-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathuntangled_ls03_mutate.qmd
371 lines (260 loc) · 13.8 KB
/
untangled_ls03_mutate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
title: 'Mutating columns'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
chunk_output_type: console
---
```{r, echo = F, message = F, warning = F}
if(!require(pacman)) install.packages("pacman")
pacman::p_load(knitr,
here,
janitor,
tidyverse)
### functions
source(here::here("global/functions/misc_functions.R"))
### default render
registerS3method("reactable_10_rows", "data.frame", reactable_10_rows)
knitr::opts_chunk$set(class.source = "tgc-code-block", render = head_5_rows)
### autograders
suppressMessages(source(here::here("lessons/ls03_mutate_autograder.R")))
```
## Intro
You now know how to keep or drop columns and rows from your dataset. Today you will learn how to modify existing variables or create new ones, using the `mutate()` verb from {dplyr}. This is an essential step in most data analysis projects.
Let's go!
![Fig: the `mutate()` verb.](images/custom_dplyr_mutate.png){width="400"}
## Learning objectives
1. You can use the `mutate()` function from the `{dplyr}` package to create new variables or modify existing variables.
2. You can create new numeric, character, factor, and boolean variables
## Packages
This lesson will require the packages loaded below:
```{r}
if(!require(pacman)) install.packages("pacman")
pacman::p_load(here,
janitor,
tidyverse)
```
## Datasets
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon. Below, we import the dataset `yaounde` and create a smaller subset called `yao`. Note that this dataset is slightly different from the one used in the previous lesson.
```{r, message = F, render = head_10_rows}
yaounde <- read_csv(here::here('data/yaounde_data.csv'))
### a smaller subset of variables
yao <- yaounde %>% select(date_surveyed,
age,
weight_kg, height_cm,
symptoms, is_smoker)
yao
```
We will also use a dataset from a cross-sectional study that aimed to determine the prevalence of sarcopenia in the elderly population (\>60 years) in in Karnataka, India. Sarcopenia is a condition that is common in elderly people and is characterized by progressive and generalized loss of skeletal muscle mass and strength. The data was obtained from Zenodo [here](https://zenodo.org/record/3691939), and the source publication can be found [here](https://doi.org/10.12688/f1000research.22580.1).
Below, we import and view this dataset:
```{r, message = F, render = head_10_rows}
sarcopenia <- read_csv(here::here('data/sarcopenia_elderly.csv'))
sarcopenia
```
## Introducing `mutate()`
![The `mutate()` function. (Drawing adapted from Allison Horst)](images/dplyr_mutate.png){width="336"}
We use `dplyr::mutate()` to create new variables or modify existing variables. The syntax is quite intuitive, and generally looks like `df %>% mutate(new_column_name = what_it_contains)`.
Let's see a quick example.
The `yaounde` dataset currently contains a column called `height_cm`, which shows the height, in centimeters, of survey respondents. Let's create a data frame, `yao_height`, with just this column, for easy illustration:
```{r}
yao_height <- yaounde %>% select(height_cm)
yao_height
```
What if you wanted to **create a new variable**, called `height_meters` where heights are converted to meters? You can use `mutate()` for this, with the argument `height_meters = height_cm/100`:
```{r}
yao_height %>%
mutate(height_meters = height_cm/100)
```
Great. The syntax is beautifully simple, isn't it?
Now, imagine there was a small error in the equipment used to measure respondent heights, and all heights are 5cm too small. You therefore like to add 5cm to all heights in the dataset. To do this, rather than creating a new variable as you did before, you can **modify the existing variable** with mutate:
```{r}
yao_height %>%
mutate(height_cm = height_cm + 5)
```
Again, very easy to do!
::: {.callout-tip title='Practice'}
The `sarcopenia` data frame has a variable `weight_kg`, which contains respondents' weights in kilograms. Create a new column, called `weight_grams`, with respondents' weights in grams. Store your answer in the `Q_weight_to_g` object. (1 kg equals 1000 grams.)
```{r, eval = FALSE}
## Complete the code with your answer:
Q_weight_to_g <-
sarcopenia %>%
_____________________
```
```{r, include = FALSE}
## Check your answer:
.CHECK_Q_weight_to_g()
.HINT_Q_weight_to_g()
## To obtain the solution, run the line below!
.SOLUTION_Q_weight_to_g()
## Each question has a solution function similar to this.
## (Where HINT is replaced with SOLUTION in the function name.)
## But you will need to type out the function name on your own.
## (This is to discourage you from looking at the solution before answering the question.)
```
:::
Hopefully you now see that the mutate function is quite user-friendly. In theory, we could end the lesson here, because you now know how to use `mutate()` 😃. But of course, the devil will be in the details---the interesting thing is not `mutate()` itself but what goes *inside* the `mutate()` call.
The rest of the lesson will go through a few use cases for the `mutate()` verb. In the process, we'll touch on several new functions you have not yet encountered.
## Creating a Boolean variable
You can use `mutate()` to create a Boolean variable to categorize part of your population.
Below we create a Boolean variable, `is_child` which is either `TRUE` if the subject is a child or `FALSE` if the subject is an adult (first, we select just the `age` variable so it's easy to see what is being done; you will likely not need this pre-selection for your own analyses).
```{r}
yao %>%
select(age) %>%
mutate(is_child = age <= 18)
```
The code `age <= 18` evaluates whether each age is less than or equal to 18. Ages that match that condition (ages 18 and under) are `TRUE` and those that fail the condition are `FALSE`.
Such a variable is useful to, for example, count the number of children in the dataset. The code below does this with the `janitor::tabyl()` function:
```{r}
yao %>%
mutate(is_child = age <= 18) %>%
tabyl(is_child)
```
You can observe that 31.8% (0.318...) of respondents in the dataset are children.
Let's see one more example, since the concept of Boolean variables can be a bit confusing. The `symptoms` variable reports any respiratory symptoms experienced by the patient:
```{r}
yao %>%
select(symptoms)
```
You could create a Boolean variable, called `has_no_symptoms`, that is set to `TRUE` if the respondent reported no symptoms:
```{r}
yao %>%
select(symptoms) %>%
mutate(has_no_symptoms = symptoms == "No symptoms")
```
Similarly, you could create a Boolean variable called `has_any_symptoms` that is set to `TRUE` if the respondent reported any symptoms. For this, you'd simply swap the `symptoms == "No symptoms"` code for `symptoms != "No symptoms"`:
```{r}
yao %>%
select(symptoms) %>%
mutate(has_any_symptoms = symptoms != "No symptoms")
```
Still confused by the Boolean examples? That's normal. Pause and play with the code above a little. Then try the practice question below
::: {.callout-tip title='Practice'}
Women with a grip strength below 20kg are considered to have low grip strength. With a female subset of the `sarcopenia` data frame, add a variable called `low_grip_strength` that is `TRUE` for women with a grip strength \< 20 kg and FALSE for other women.
```{r, eval = FALSE}
## Complete the code with your answer:
Q_women_low_grip_strength <-
sarcopenia %>%
filter(sex_male_1_female_0 == 0) # first we filter the dataset to only women
# mutate code here
```
```{r, include = FALSE}
## Check your answer:
.CHECK_Q_women_low_grip_strength()
.HINT_Q_women_low_grip_strength()
```
What percentage of women surveyed have a low grip strength according to the definition above? Enter your answer as a number without quotes (e.g. 43.3 or 12.2), to one decimal place.
```{r, eval = FALSE}
Q_prop_women_low_grip_strength <- YOUR_ANSWER_HERE
```
```{r, include = FALSE}
## Check your answer:
.CHECK_Q_prop_women_low_grip_strength()
.HINT_Q_prop_women_low_grip_strength()
```
:::
## Creating a numeric variable based on a formula
Now, let's look at an example of creating a numeric variable, the body mass index (BMI), which a commonly used health indicator. The formula for the body mass index can be written as:
$$
BMI = \frac{weight (kilograms)}{height (meters)^2}
$$ You can use `mutate()` to calculate BMI in the `yao` dataset as follows:
```{r}
yao %>%
select(weight_kg, height_cm) %>%
# first obtain the height in meters
mutate(height_meters = height_cm/100) %>%
# then use the BMI formula
mutate(bmi = weight_kg / (height_meters)^2)
```
Let's save the data frame with BMIs for later. We will use it in the next section.
```{r}
yao_bmi <-
yao %>%
select(weight_kg, height_cm) %>%
# first obtain the height in meters
mutate(height_meters = height_cm/100) %>%
# then use the BMI formula
mutate(bmi = weight_kg / (height_meters)^2)
```
::: {.callout-tip title='Practice'}
Appendicular muscle mass (ASM), a useful health indicator, is the sum of muscle mass in all 4 limbs. It can predicted with the following formula, called Lee's equation:
$$ASM(kg)= (0.244 \times weight(kg)) + (7.8 \times height(m)) + (6.6 \times sex) - (0.098 \times age) - 4.5$$
The `sex` variable in the formula assumes that men are coded as 1 and women are coded as 0 (which is already the case for our `sarcopenia` dataset.) The `- 4.5` at the end is a constant used for Asians.
Calculate the ASM value for all individuals in the `sarcopenia` dataset. This value should be in a new column called `asm`
```{r, eval = FALSE}
## Complete the code with your answer:
Q_asm_calculation <-
sarcopenia #_____
#________________
```
```{r, include = FALSE}
## Check your answer:
.CHECK_Q_asm_calculation()
.HINT_Q_asm_calculation()
```
:::
## Changing a variable's type
In your data analysis workflow, you often need to redefine variable *types*. You can do so with functions like `as.integer()`, `as.factor()`, `as.character()` and `as.Date()` within your `mutate()` call. Let's see one example of this.
### Integer: `as.integer`
`as.integer()` converts any numeric values to integers:
```{r}
yao_bmi %>%
mutate(bmi_integer = as.integer(bmi))
```
Note that this *truncates* integers rather than rounding them up or down, as you might expect. For example the BMI 22.8 in the third row is truncated to 22. If you want rounded numbers, you can use the `round` function from base R
::: {.callout-note title='Pro Tip'}
Using `as.integer()` on a factor variable is a fast way of encoding strings into numbers. It can be essential to do so for some machine learning data processing.
:::
```{r}
yao_bmi %>%
mutate(bmi_integer = as.integer(bmi),
bmi_rounded = round(bmi))
```
::: {.callout-note title='Side Note'}
The base R `round()` function rounds "half down". That is, the number 3.5, for example, is rounded down to 3 by `round()`. This is weird. Most people expect 3.5 to be rounded *up* to 4, not down to 3. So most of the time, you'll actually want to use the `round_half_up()` function from janitor.
:::
::: {.callout-note title='Challenge'}
In future lessons, you will discover how to manipulate dates and how to convert to a date type using `as.Date()`.
:::
::: {.callout-tip title='Practice'}
Use `as_integer()` to convert the ages of respondents in the `sarcopenia` dataset to integers (truncating them in the process). This should go in a new column called `age_integer`
```{r, eval = FALSE}
## Complete the code with your answer:
Q_age_integer <-
sarcopenia #_____
#________________
```
```{r, include = FALSE}
## Check your answer:
.CHECK_Q_age_integer()
.HINT_Q_age_integer()
```
:::
## Wrap up
As you can imagine, transforming data is an essential step in any data analysis workflow. It is often required to clean data and to prepare it for further statistical analysis or for making plots. And as you have seen, it is quite simple to transform data with dplyr's `mutate()` function, although certain transformations are trickier to achieve than others.
Congrats on making it through.
But your data wrangling journey isn't over yet! In our next lessons, we will learn how to create complex data summaries and how to create and work with data frame groups. Intrigued? See you in the next lesson.
![Fig: Basic Data Wrangling with `select()`, `filter()`, and `mutate()`.](images/custom_dplyr_basic_3.png){width="400"}
`r tgc_contributors_list(ids = c("lolovanco", "avallecam", "kendavidn"))`
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Horst, A. (2022). *Dplyr-learnr*. <https://github.com/allisonhorst/dplyr-learnr> (Original work published 2020)
- *Create, modify, and delete columns --- Mutate*. (n.d.). Retrieved 21 February 2022, from <https://dplyr.tidyverse.org/reference/mutate.html>
- *Apply a function (or functions) across multiple columns --- Across*. (n.d.). Retrieved 21 February 2022, from <https://dplyr.tidyverse.org/reference/across.html>
Artwork was adapted from:
- Horst, A. (2022). *R & stats illustrations by Allison Horst*. <https://github.com/allisonhorst/stats-illustrations> (Original work published 2018)
Other references:
- Lee, Robert C, ZiMian Wang, Moonseong Heo, Robert Ross, Ian Janssen, and Steven B Heymsfield. "Total-Body Skeletal Muscle Mass: Development and Cross-Validation of Anthropometric Prediction Models." *The American Journal of Clinical Nutrition* 72, no. 3 (2000): 796--803. <https://doi.org/10.1093/ajcn/72.3.796.>
## Solutions
```{r}
.SOLUTION_Q_weight_to_g()
.SOLUTION_Q_sarcopenia_resp_id()
.SOLUTION_Q_women_low_grip_strength()
.SOLUTION_Q_prop_women_low_grip_strength()
.SOLUTION_Q_asm_calculation()
.SOLUTION_Q_age_integer()
```