-
Notifications
You must be signed in to change notification settings - Fork 0
/
lyon_topic_intro.qmd
103 lines (74 loc) · 3.4 KB
/
lyon_topic_intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: "Lyon Training Code"
author: "Nick J Lyon"
format: pdf
editor: visual
---
# Manipulating, Analyzing and Exporting Data with `tidyverse`
This episode is taken from the larger Carpentries lesson: "[Data Analysis and Visualization in R for Ecologists](https://datacarpentry.org/R-ecology-lesson/index.html)"
## Loading Necessary Tools and Data
If you have not already done so, you will need to install the `tidyverse` package. Note that this is really a suite of related packages that are inter-related and that we will only use a few functions from some of those packages.
```{r, eval = FALSE}
install.packages("tidyverse")
```
Once you have installed this package, you need to load it to be able to use its functions later.
```{r, message = FALSE}
library(tidyverse)
```
We will also need to use a dataset to demonstrate some of these tools so we can load the `iris` datasdet built into R.
## `tidyverse` - Data Wrangling
Below is a list of core `tidyverse` functions with a short description of their purpose. Recall that in the `::` notation the **package name is on the left** and the **function name is on the right**
- `dplyr::select()`: Choose which columns to include or exclude by their name (rather than the bracket/number method above)
```{r}
head(dplyr::select(.data = iris, Sepal.Length))
```
- `dplyr::filter()`: Subset a dataframe to only those rows that meet a given condition
```{r}
sub <- dplyr::filter(.data = iris, Species == "setosa")
# Can check by looking at the number of rows of the full dataframe
## versus the number of rows of the subsetted dataframe
nrow(iris)
nrow(sub)
```
- `dplyr::mutate()`: Creates new columns in a dataframe
```{r}
petals <- dplyr::mutate(.data = iris, Petal.Area = (Petal.Length * Petal.Width) )
head(dplyr::select(.data = petals, dplyr::starts_with("Petal.")))
```
- `magrittr` has a "pipe" function (`%>%`) for chaining together multiple other functions without needing to create a new object for each operation.
```{r}
iris %>%
dplyr::filter(Species == "setosa") %>%
dplyr::mutate(Petal.Area = (Petal.Length * Petal.Width) ) %>%
dplyr::select(dplyr::starts_with("Petal.")) %>%
head()
```
- `dplyr::group_by()` can be used with `dplyr::summarise()` to gather summary statistics within groups
- **WARNING**: `summarise()` drops all columns that are not specified
- *NOTE*: the American English `dplyr::summarize()` also works if you prefer
```{r}
iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(Mean.Petal.Length = mean(x = Petal.Length))
```
## `tidyverse` - Data Reshaping
Data can be either in "wide" or in "long" format. In wide format, there are more columns than there are rows while in long format the opposite is true. The `tidyr` package provides one function for reshaping data in either direction.
- `tidyr::pivot_longer()`: Rotate a wide dataframe into long format
```{r}
iris %>%
tidyr::pivot_longer(cols = -Species,
names_to = "Iris.Metric",
values_to = "Measurement") %>%
head()
```
- `tidyr::pivot_wider()`: Rotate a long dataframe into wide format
```{r}
# Create a smaller dataframe to make the pivot easier to visualize
iris_summary <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(Mean.Petal.Length = mean(x = Petal.Length))
head(iris_summary)
iris_summary %>%
tidyr::pivot_wider(names_from = Species,
values_from = Mean.Petal.Length)
```