-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathuntangled_ls01_select_rename.qmd
323 lines (222 loc) · 9.88 KB
/
untangled_ls01_select_rename.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: 'Selecting and renaming columns'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
chunk_output_type: console
---
```{r, echo = F, message = F, warning = F}
## Load packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(rlang, tidyverse, knitr, here)
## Source functions
source(here("global/functions/misc_functions.R"))
## knitr settings
knitr::opts_chunk$set(warning = F, message = F, class.source = "tgc-code-block", error = T)
## autograders
mute(here("lessons/ls01_select_rename_autograder.R"))
```
## Introduction
Today we will begin our exploration of the {dplyr} package! Our first verb on the list is `select` which allows to keep or drop variables from your dataframe. Choosing your variables is the first step in cleaning your data.
![Fig: the `select()` function.](images/custom_dplyr_select.png){width="408"}
Let's go !
## Learning objectives
- You can keep or drop columns from a dataframe using the `dplyr::select()` function from the {dplyr} package.
- You can select a range or combination of columns using operators like the colon (`:`), the exclamation mark (`!`), and the `c()` function.
- You can select columns based on patterns in their names with helper functions like `starts_with()`, `ends_with()`, `contains()`, and `everything()`.
- You can use `rename()` and `select()` to change column names.
## The Yaounde COVID-19 dataset
In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from [Zenodo](https://zenodo.org/record/5218965){target="_blank"}, and the paper can be viewed [here](https://www.nature.com/articles/s41467-021-25946-0){target="_blank"}.
Spend some time browsing through this dataset. Each line corresponds to one patient surveyed. There are some demographic, socio-economic and COVID-related variables. The results of the IgG and IgM antibody tests are in the columns `igg_result` and `igm_result`.
```{r, message = F, render = head_5_rows}
yaounde <- read_csv(here::here("data/yaounde_data.csv"))
yaounde
```
![Left: the Yaounde survey team. Right: an antibody test being administered.](images/yao_survey_team.png){width="450"}
## Introducing `select()`
![Fig: the `select()` function. (Drawing adapted from Allison Horst).](images/final_dplyr_select.png){width="408"}
`dplyr::select()` lets us pick which columns (variables) to keep or drop.
We can select a column **by name**:
```{r, render = head_5_rows}
yaounde %>% select(age)
```
Or we can select a column **by position**:
```{r, render = head_5_rows}
yaounde %>% select(3) # `age` is the 3rd column
```
To select **multiple variables**, we separate them with commas:
```{r}
yaounde %>% select(age, sex, igg_result)
```
::: {.callout-tip title='Practice'}
- Select the weight and height variables in the `yaounde` data frame.
```{r, eval = F, echo = FALSE}
## For this first question we'll provide the answer. Your code should be:
Q_weight_height <- yaounde %>% select(weight_kg, height_cm)
## Run that line, then run the CHECK and HINT functions below
.CHECK_Q_weight_height()
.HINT_Q_weight_height()
## Now, to obtain the solution, run the line below!
.SOLUTION_Q_weight_height()
## Each question has a solution function similar to this (.SOLUTION_Q_xxx_xxx())
## But you will need to type out the function name on your own.
## (This is to discourage you from looking at the solution before answering the question.)
```
- Select the 16th and 22nd columns in the `yaounde` data frame.
```{r, eval = F, echo = FALSE}
Q_cols_16_22 <- "YOUR ANSWER HERE"
.CHECK_Q_cols_16_22()
.HINT_Q_cols_16_22()
```
:::
------------------------------------------------------------------------
For the next part of the tutorial, let's create a smaller subset of the data, called `yao`.
```{r, render = head_5_rows}
yao <-
yaounde %>% select(age,
sex,
highest_education,
occupation,
is_smoker,
is_pregnant,
igg_result,
igm_result)
yao
```
### Selecting column ranges with `:`
The `:` operator selects a **range of consecutive variables**:
```{r, render = head_5_rows}
yao %>% select(age:occupation) # Select all columns from `age` to `occupation`
```
We can also specify a range with column numbers:
```{r, render = head_5_rows}
yao %>% select(1:4) # Select columns 1 to 4
```
::: {.callout-tip title='Practice'}
- With the `yaounde` data frame, select the columns between `symptoms` and `sequelae`, inclusive. ("Inclusive" means you should also include `symptoms` and `sequelae` in the selection.)
```{r, eval = F, echo = FALSE}
Q_symp_to_sequel <- "YOUR ANSWER HERE"
.CHECK_Q_symp_to_sequel()
.HINT_Q_symp_to_sequel()
```
:::
### Excluding columns with `!`
The **exclamation point** negates a selection:
```{r, render = head_5_rows}
yao %>% select(!age) # Select all columns except `age`
```
To drop a range of consecutive columns, we use, for example,`!age:occupation`:
```{r, render = head_5_rows}
yao %>% select(!age:occupation) # Drop columns from `age` to `occupation`
```
To drop several non-consecutive columns, place them inside `!c()`:
```{r, render = head_5_rows}
yao %>% select(!c(age, sex, igg_result))
```
::: {.callout-tip title='Practice'}
- From the `yaounde` data frame, **remove** all columns between `highest_education` and `consultation`, inclusive.
```{r, eval = F, echo = FALSE}
Q_educ_consult <- "YOUR ANSWER HERE"
.CHECK_Q_educ_consult()
.HINT_Q_educ_consult()
```
:::
## Helper functions for `select()`
`dplyr` has a number of helper functions to make selecting easier by using patterns from the column names. Let's take a look at some of these.
### `starts_with()` and `ends_with()`
These two helpers work exactly as their names suggest!
```{r, render = head_5_rows}
yao %>% select(starts_with("is_")) # Columns that start with "is"
yao %>% select(ends_with("_result")) # Columns that end with "result"
```
### `contains()`
`contains()` helps select columns that contain a certain string:
```{r, render = head_5_rows}
yaounde %>% select(contains("drug")) # Columns that contain the string "drug"
```
### `everything()`
Another helper function, `everything()`, matches all variables that have not yet been selected.
```{r, render = head_5_rows}
## First, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())
```
It is often useful for establishing the order of columns.
Say we wanted to bring the `is_pregnant` column to the start of the `yao` data frame, we could type out all the column names manually:
```{r, render = head_5_rows}
yao %>% select(is_pregnant,
age,
sex,
highest_education,
occupation,
is_smoker,
igg_result,
igm_result)
```
But this would be painful for larger data frames, such as our original `yaounde` data frame. In such a case, we can use `everything()`:
```{r, render = head_5_rows}
## Bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())
```
This helper can be combined with many others.
```{r, render = head_5_rows}
## Bring columns that end with "result" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())
```
::: {.callout-tip title='Practice'}
- Select all columns in the `yaounde` data frame that start with "is\_".
```{r, eval = F, echo = FALSE}
Q_starts_with_is <- "YOUR ANSWER HERE"
.CHECK_Q_starts_with_is()
.HINT_Q_starts_with_is()
```
- Move the columns that start with "is\_" to the beginning of the `yaounde` data frame.
```{r, eval = F, echo = FALSE}
Q_rearrange <- "YOUR ANSWER HERE"
.CHECK_Q_rearrange()
.HINT_Q_rearrange()
```
:::
## Change column names with `rename()`
![Fig: the `rename()` function. (Drawing adapted from Allison Horst)](images/dplyr_rename.png)
[`dplyr::rename()`](https://dplyr.tidyverse.org/reference/rename.html) is used to change column names:
```{r, render = head_5_rows}
## Rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>%
rename(patient_age = age,
patient_sex = sex)
```
::: {.callout-caution title='Watch Out'}
The fact that the new name comes first in the function (`rename(NEWNAME = OLDNAME)`) is sometimes confusing. You should get used to this with time.
:::
### Rename within `select()`
You can also rename columns while selecting them:
```{r, render = head_5_rows}
## Select `age` and `sex`, and rename them to `patient_age` and `patient_sex`
yaounde %>%
select(patient_age = age,
patient_sex = sex)
```
## Wrap up
I hope this first lesson has allowed you to see how intuitive and useful the {dplyr} verbs are! This is the first of a series of basic data wrangling verbs: see you in the next lesson to learn more.
![Fig: Basic Data Wrangling Dplyr Verbs.](images/custom_dplyr_basic_1.png){width="400"}
`r tgc_contributors_list(ids = c("lolovanco", "avallecam", "kendavidn"))`
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Horst, A. (2021). *Dplyr-learnr*. <https://github.com/allisonhorst/dplyr-learnr> (Original work published 2020)
- *Subset columns using their names and types---Select*. (n.d.). Retrieved 31 December 2021, from <https://dplyr.tidyverse.org/reference/select.html>
Artwork was adapted from:
- Horst, A. (2021). *R & stats illustrations by Allison Horst*. <https://github.com/allisonhorst/stats-illustrations> (Original work published 2018)
## Solutions
```{r}
.SOLUTION_Q_weight_height()
.SOLUTION_Q_cols_16_22()
.SOLUTION_Q_symp_to_sequel()
.SOLUTION_Q_educ_consult()
.SOLUTION_Q_starts_with_is()
.SOLUTION_Q_rearrange()
```