-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathuntangled_ls04_conditional_mutate.qmd
807 lines (582 loc) · 32 KB
/
untangled_ls04_conditional_mutate.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
---
title: 'Conditional mutating'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
chunk_output_type: console
---
```{r, echo = F, message = F, warning = F}
if(!require(pacman)) install.packages("pacman")
pacman::p_load(knitr,
here,
janitor,
tidyverse)
### functions
source(here::here("global/functions/misc_functions.R"))
### default render
registerS3method("reactable_10_rows", "data.frame", reactable_10_rows)
knitr::opts_chunk$set(class.source = "tgc-code-block", render = head_10_rows)
### autograders
suppressMessages(source(here::here("lessons/ls04_conditional_mutate_autograder.R")))
```
## Introduction
In the last lesson, you learned the basics of data transformation using the {dplyr} function `mutate()`.
In that lesson, we mostly looked at *global* transformations; that is, transformations that did the same thing to an entire variable. In this lesson, we will look at how to *conditionally* manipulate certain rows based on whether or not they meet defined criteria.
For this, we will mostly use the `case_when()` function, which you will likely come to see as one of the most important functions in {dplyr} for data wrangling tasks.
Let's get started.
![Fig: the case_when() conditions.](images/custom_dplyr_case_when.png){width="400"}
## Learning objectives
1. You can transform or create new variables based on conditions using `dplyr::case_when()`
2. You know how to use the `TRUE` condition in `case_when()` to match unmatched cases.
3. You can handle `NA` values in `case_when()` transformations.
4. You understand how to keep the default values of a variable in a `case_when()` formula
5. You can write `case_when()` conditions involving multiple comparators and multiple variables.
6. You understand `case_when()` conditions priority order.
7. You can use `dplyr::if_else()` for binary conditional assignment.
## Packages
This lesson will require the tidyverse suite of packages:
```{r}
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse)
```
## Datasets
In this lesson, we will again use data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
```{r, message = F, render = head_10_rows}
## Import and view the dataset
yaounde <-
read_csv(here::here('data/yaounde_data.csv')) %>%
## make every 5th age missing
mutate(age = case_when(row_number() %in% seq(5, 900, by = 5) ~ NA_real_,
TRUE ~ age)) %>%
## rename the age variable
rename(age_years = age) %>%
# drop the age category column
select(-age_category)
yaounde
```
Note that in the code chunk above, we slightly modified the age column, artificially introducing some missing values, and we also dropped the `age_category` column. This is to help illustrate some key points in the tutorial.
------------------------------------------------------------------------
For practice questions, we will also use an outbreak linelist of 136 cases of influenza A H7N9 from a [2013 outbreak](https://en.wikipedia.org/wiki/Influenza_A_virus_subtype_H7N9#Reported_cases_in_2013) in China. This is a modified version of a dataset compiled by Kucharski et al. (2014).
```{r, message = F, render = head_10_rows}
## Import and view the dataset
flu_linelist <- read_csv(here::here('data/flu_h7n9_china_2013.csv'))
flu_linelist
```
## Reminder: relational operators (comparators) in R
Throughout this lesson, you will use a lot of relational operators in R. Recall that relational operators, sometimes called "comparators", test the relation between two values, and return `TRUE`, `FALSE` or `NA`.
A list of the most common operators is given below:
| | |
|:-------------|:------------------------------------|
| **Operator** | **is TRUE if** |
| A \< B | A is **less than** B |
| A \<= B | A is **less than or equal** to B |
| A \> B | A is **greater than** B |
| A \>= B | A is **greater than or equal to** B |
| A == B | A is **equal** to B |
| A != B | A is **not equal** to B |
| A %in% B | A **is an element of** B |
## Introduction to `case_when()`
To get familiar with `case_when(),` let's begin with a simple conditional transformation on the `age_years` column of the `yaounde` dataset. First we subset the data frame to just the `age_years` column for easy illustration:
```{r}
yaounde_age <-
yaounde %>%
select(age_years)
yaounde_age
```
Now, using `case_when()`, we can make a new column, called "age_group", that has the value "Child" if the person is below 18, and "Adult" if the person is 18 and up:
```{r}
yaounde_age %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
age_years >= 18 ~ "Adult"))
```
The `case_when()` syntax may seem a bit foreign, but it is quite simple: on the left-hand side (LHS) of the `~` sign (called a "tilde"), you provide the condition(s) you want to evaluate, and on the right-hand side (RHS), you provide a value to put in if the condition is true.
So the statement `case_when(age_years < 18 ~ "Child", age_years >= 18 ~ "Adult")` can be read as: "if `age_years` is below 18, input 'Child', else if `age_years` is greater than or equal to 18, input 'Adult'".
::: {.callout-note title='Vocab'}
**Formulas, LHS and RHS**
Each line of a `case_when()` call is termed a "formula" or, sometimes, a "two-sided formula". And each formula has a left-hand side (abbreviated LHS) and right-hand side (abbreviated RHS).
For example, the code `age_years < 18 ~ "Child"` is a "formula", its LHS is `age_years < 18` while its RHS is `"Child"`.
You are likely to come across these terms when reading the documentation for the `case_when()` function, and we will also refer to them in this lesson.
:::
------------------------------------------------------------------------
After creating a new variable with `case_when()`, it is a good idea to inspect it thoroughly to make sure it worked as intended.
To inspect the variable, you can pipe your data frame into the `View()` function to view it in spreadsheet form:
```{r eval = FALSE}
yaounde_age %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
age_years >= 18 ~ "Adult")) %>%
View()
```
This would open up a new tab in RStudio where you should manually scan through the new column, `age_group` and the referenced column `age_years` to make sure your `case_when()` statement did what you wanted it to do.
You could also pass the new column into the `tabyl()` function to ensure that the proportions "make sense":
```{r}
yaounde_age %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
age_years >= 18 ~ "Adult")) %>%
tabyl(age_group)
```
::: {.callout-tip title='Practice'}
With the `flu_linelist` data, make a new column, called `age_group`, that has the value "Below 50" for people under 50 and "50 and above" for people aged 50 and up. Use the `case_when()` function.
```{r eval = FALSE}
## Complete the code with your answer:
Q_age_group <-
flu_linelist %>%
mutate(age_group = ______________________________)
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_age_group()
.HINT_Q_age_group()
## To get the solution, run the line below!
.SOLUTION_Q_age_group()
## Each question has a solution function similar to this.
## (Where HINT is replaced with SOLUTION in the function name.)
## But you will need to type out the function name on your own.
## (This is to discourage you from looking at the solution before answering the question.)
```
Out of the entire sample of individuals in the `flu_linelist` dataset, what percentage are confirmed to be below 60? (Repeat the above procedure but with the `60` cutoff, then call `tabyl()` on the age group variable. Use the `percent` column, not the `valid_percent` column.)
```{r eval = FALSE}
## Enter your answer as a WHOLE number without quotes:
Q_age_group_percentage <- YOUR_ANSWER_HERE
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_age_group_percentage()
.HINT_Q_age_group_percentage()
```
:::
## The `TRUE` default argument
In a `case_when()` statement, you can use a literal `TRUE` condition to match any rows not yet matched with provided conditions.
For example, if we only keep only the first condition from the previous example, `age_years < 18`, and define the default value to be `TRUE ~ "Not child"` then all adults and `NA` values in the data set will be labeled `"Not child"` by default.
```{r render = head_10_rows}
yaounde_age %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
TRUE ~ "Not child"))
```
This `TRUE` condition can be read as "for everything else...".
So the full `case_when()` statement used above, `age_years < 18 ~ "Child", TRUE ~ "Not child"`, would then be read as: "if age is below 18, input 'Child' and *for everyone else not yet matched*, input 'Not child'".
::: {.callout-caution title='Watch Out'}
It is important to use `TRUE` as the *final* condition in `case_when()`. If you use it as the first condition, it will take precedence over all others, as seen here:
```{r}
yaounde_age %>%
mutate(age_group = case_when(TRUE ~ "Not child",
age_years < 18 ~ "Child"))
```
As you can observe, all individuals are now coded with "Not child", because the `TRUE` condition was placed first, and therefore took precedence. We will explore the issue of precedence further below.
:::
## Matching NA's with `is.na()`
We can match missing values manually with `is.na()`. Below we match `NA` ages with `is.na()` and set their age group to "Missing age":
```{r render = head_10_rows}
yaounde_age %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
age_years >= 18 ~ "Adult",
is.na(age_years) ~ "Missing age"))
```
::: {.callout-tip title='Practice'}
As before, using the `flu_linelist` data, make a new column, called `age_group`, that has the value "Below 60" for people under 60 and "60 and above" for people aged 60 and up. But this time, also set those with missing ages to "Missing age".
```{r eval = FALSE}
## Complete the code with your answer:
Q_age_group_nas <-
flu_linelist %>%
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_age_group_nas()
.HINT_Q_age_group_nas()
```
:::
::: {.callout-tip title='Practice'}
The `gender` column of the `flu_linelist` dataset contains the values "f", "m" and `NA`:
```{r}
flu_linelist %>%
tabyl(gender)
```
Recode "f", "m" and `NA` to "Female", "Male" and "Missing gender" respectively. You should modify the existing `gender` column, not create a new column.
```{r eval = FALSE}
## Complete the code with your answer:
Q_gender_recode <-
flu_linelist %>%
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_gender_recode()
.HINT_Q_gender_recode()
```
:::
## Keeping default values of a variable
The right-hand side (RHS) of a `case_when()` formula can also take in a variable from your data frame. This is often useful when you want to change just a few values in a column.
Let's see an example with the `highest_education` column, which contains the highest education level attained by a respondent:
```{r}
yaounde_educ <-
yaounde %>%
select(highest_education)
yaounde_educ
```
Below, we create a new column, `highest_educ_recode`, where we recode both "University" and "Doctorate" to the value "Post-secondary":
```{r render = head_10_rows}
yaounde_educ %>%
mutate(
highest_educ_recode =
case_when(
highest_education %in% c("University", "Doctorate") ~ "Post-secondary"
)
)
```
It worked, but now we have `NA`s for all other rows. To keep these other rows at their default values, we can add the line `TRUE ~ highest_education` (with a variable, `highest_education`, on the right-hand side of a formula):
```{r}
yaounde_educ %>%
mutate(
highest_educ_recode =
case_when(
highest_education %in% c("University", "Doctorate") ~ "Post-secondary",
TRUE ~ highest_education
)
)
```
Now the `case_when()` statement reads: 'If highest education is "University" or "Doctorate", input "Post-secondary". For everyone else, input the value from `highest_education`'.
------------------------------------------------------------------------
Above we have been putting the recoded values in a separate column, `highest_educ_recode`, but for this kind of replacement, it is more common to simply overwrite the existing column:
```{r}
yaounde_educ %>%
mutate(
highest_education =
case_when(
highest_education %in% c("University", "Doctorate") ~ "Post-secondary",
TRUE ~ highest_education
)
)
```
We can read this last `case_when()` statement as: 'If highest education is "University" or "Doctorate", *change the value to* "Post-secondary". For everyone else, *leave in* the value from `highest_education`'.
::: {.callout-tip title='Practice'}
Using the `flu_linelist` data, modify the existing column `outcome` by replacing the value "Recover" with "Recovery".
```{r eval = FALSE}
## Complete the code with your answer:
Q_recode_recovery <-
flu_linelist
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_recode_recovery()
.HINT_Q_recode_recovery()
```
(We know it's a lot of code for such a simple change. Later you will see easier ways to do this.)
:::
::: {.callout-note title='Pro Tip'}
**Avoiding long code lines** As you start to write increasingly complex `case_when()` statements, it will become helpful to use line breaks to avoid long lines of code.
To assist with creating line breaks, you can use the {styler} package. Install it with `pacman::p_load(styler)`. Then to reformat any piece of code, highlight the code, click the "Addins" button in RStudio, then click on "Style selection":
![](images/styler_style_selection.png){width="317"}
Alternatively, you could highlight the code and use the shortcut `Shift` + `Command/Control` + `A` to use RStudio's built-in code reformatter.
Sometimes {styler} does a better job at reformatting. Sometimes the built-in reformatter does a better job.
:::
## Multiple conditions on a single variable
LHS conditions in `case_when()` formulas can have multiple parts. Let's see an example of this.
But first, we will inspire ourselves from what we learnt in the `mutate()` lesson and recreate the BMI variable. This involves first converting the `height_cm` variable to meters, then calculating BMI.
```{r}
yaounde_BMI <-
yaounde %>%
mutate(height_m = height_cm/100,
BMI = (weight_kg / (height_m)^2)) %>%
select(BMI)
yaounde_BMI
```
Recall the following BMI categories:
- If the BMI is inferior to 18.5, the person is considered underweight.
- A normal BMI is greater than or equal to 18.5 and less than 25.
- An overweight BMI is greater than or equal to 25 and less than 30.
- An obese BMI is BMI is greater than or equal to 30.
The condition `BMI >= 18.5 & BMI < 25` to define `Normal weight` is a compound condition because it has *two* comparators: `>=` and `<`.
```{r}
yaounde_BMI <-
yaounde_BMI %>%
mutate(BMI_classification = case_when(
BMI < 18.5 ~'Underweight',
BMI >= 18.5 & BMI < 25 ~ 'Normal weight',
BMI >= 25 & BMI < 30 ~ 'Overweight',
BMI >= 30 ~ 'Obese'))
yaounde_BMI
```
Let's use `tabyl()` to have a look at our data:
```{r eval = FALSE}
yaounde_BMI %>%
tabyl(BMI_classification)
```
But you can see that the levels of BMI are defined in alphabetical order from Normal weight to Underweight, instead of from lightest (Underweight) to heaviest (Obese). Remember that if you want to have a certain order you can make `BMI_classification` a factor using `mutate()` and define its levels.
```{r eval = FALSE}
yaounde_BMI %>%
mutate(BMI_classification = factor(
BMI_classification,
levels = c("Obese",
"Overweight",
"Normal weight",
"Underweight")
)) %>%
tabyl(BMI_classification)
```
::: {.callout-caution title='Watch Out'}
With compound conditions, you should remember to input the variable name *everytime* there is a comparator. R learners often forget this and will try to run code that looks like this:
```{r eval = FALSE}
yaounde_BMI %>%
mutate(BMI_classification = case_when(BMI < 18.5 ~'Underweight',
BMI >= 18.5 & < 25 ~ 'Normal weight',
BMI >= 25 & < 30 ~ 'Overweight',
BMI >= 30 ~ 'Obese'))
```
The definitions for the "Normal weight" and "Overweight" categories are mistaken. Do you see the problem? Try to run the code to spot the error.
:::
::: {.callout-tip title='Practice'}
With the `flu_linelist` data, make a new column, called `adolescent`, that has the value "Yes" for people in the 10-19 (at least 10 and less than 20) age group, and "No" for everyone else.
```{r eval = FALSE}
## Complete the code with your answer:
Q_adolescent_grouping <-
flu_linelist %>%
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_adolescent_grouping()
.HINT_Q_adolescent_grouping()
```
:::
## Multiple conditions on multiple variables
In all examples seen so far, you have only used conditions involving a single variable at a time. But LHS conditions often refer to multiple variables at once.
Let's see a simple example with age and sex in the `yaounde` data frame. First, we select just these two variables for easy illustration:
```{r}
yaounde_age_sex <-
yaounde %>%
select(age_years, sex)
yaounde_age_sex
```
Now, imagine we want to recruit women and men in the 20-29 age group into two studies. For this we'd like to create a column, called `recruit`, with the following schema:
- Women aged 20-29 should have the value "Recruit to female study"
- Men aged 20-29 should have the value "Recruit to male study"
- Everyone else should have the value "Do not recruit"
To do this, we run the following case_when statement:
```{r}
yaounde_age_sex %>%
mutate(recruit = case_when(
sex == "Female" & age_years >= 20 & age_years <= 29 ~ "Recruit to female study",
sex == "Male" & age_years >= 20 & age_years <= 29 ~ "Recruit to male study",
TRUE ~ "Do not recruit"
))
```
You could also add extra pairs of parentheses around the age criteria within each condition:
```{r eval = FALSE}
yaounde_age_sex %>%
mutate(recruit = case_when(
sex == "Female" & (age_years >= 20 & age_years <= 29) ~ "Recruit to female study",
sex == "Male" & (age_years >= 20 & age_years <= 29) ~ "Recruit to male study",
TRUE ~ "Do not recruit"
))
```
This extra pair of parentheses does not change the code output, but it improves coherence because the reader can visually see that your condition is made of two parts, one for gender, `sex == "Female"`, and another for age, `(age_years >= 20 & age_years <= 29)`.
::: {.callout-tip title='Practice'}
With the `flu_linelist` data, make a new column, called `recruit` with the following schema:
- Individuals aged 30-59 (at least 30, younger than 60) from the Jiangsu province should have the value "Recruit to Jiangsu study"
- Individuals aged 30-59 from the Zhejiang province should have the value "Recruit to Zhejiang study"
- Everyone else should have the value "Do not recruit"
```{r eval = FALSE}
## Complete the code with your answer:
Q_age_province_grouping <-
flu_linelist %>%
mutate(recruit = ______________________________)
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_age_province_grouping()
.HINT_Q_age_province_grouping()
```
:::
## Order of priority of conditions in `case_when()`
Note that the order of conditions is important, because conditions listed at the top of your `case_when()` statement take priority over others.
To understand this, run the example below:
```{r render = head_10_rows}
yaounde_age_sex %>%
mutate(age_group = case_when(age_years < 18 ~ "Child",
age_years < 30 ~ "Young adult",
age_years < 120 ~ "Older adult"))
```
This initially looks like a faulty `case_when()` statement because the age conditions overlap. For example, the statement `age_years < 120 ~ "Older adult"` (which reads "if age is below 120, input 'Older adult'") suggests that *anyone* between ages 0 and 120 (even a 1-year old baby!, would be coded as "Older adult".
But as you saw, the code actually works fine! People under 18 are still coded as "Child".
What's going on? Essentially, the `case_when()` statement is interpreted as a series of branching logical steps, starting with the first condition. So this particular statement can be read as: "If age is below 18, input 'Child', *and otherwise*, if age is below 30, input 'Young adult', *and otherwise*, if age is below 120, input"Older adult".
This is illustrated in the schematic below:
![](images/case_when_eval_order_4.png){width="530"}
This means that if you swap the order of the conditions, you will end up with a faulty `case_when()` statement:
```{r render = head_10_rows}
yaounde_age %>%
mutate(age_group = case_when(age_years < 120 ~ "Older adult",
age_years < 30 ~ "Young adult",
age_years < 18 ~ "Child"))
```
As you can see, everyone is coded as "Older adult". This happens because the first condition matches everyone, so there is no one left to match with the subsequent conditions. The statement can be read "If age is below 120, input 'Older adult', *and otherwise* if age is below 30...." But there is no "otherwise" because everyone has already been matched!
This is illustrated in the diagram below:
![](images/case_when_faulty_order.png){width="532"}
Although we have spent much time explaining the importance of the order of conditions, in this specific example, there would be a much clearer way to write this code that would not depend on the order of conditions. Rather than leave the age groups open-ended like this:
> `age_years < 120 ~ "Older adult"`
you should actually use *closed* age bounds like this:
> `age_years >= 30 & age_years < 120 ~ "Older adult"`
which is read: "if age is greater than or equal to 30 and less than 120, input 'Older adult'".
With such closed conditions, the order of conditions no longer matters. You get the same result no matter how you arrange the conditions:
```{r render = head_10_rows}
## start with "Older adult" condition
yaounde_age %>%
mutate(age_group = case_when(
age_years >= 30 & age_years < 120 ~ "Older adult",
age_years >= 18 & age_years < 30 ~ "Young adult",
age_years >= 0 & age_years < 18 ~ "Child"
))
```
```{r render = head_10_rows}
## start with "Child" condition
yaounde_age %>%
mutate(age_group = case_when(
age_years >= 0 & age_years < 18 ~ "Child",
age_years >= 18 & age_years < 30 ~ "Young adult",
age_years >= 30 & age_years < 120 ~ "Older adult"
))
```
Nice and clean!
So why did we spend so much time explaining the importance of condition order if you can simply avoid open-ended categories and not have to worry about condition order?
One reason is that understanding condition order should now help you see why it is important to put the `TRUE` condition as the final line in your `case_when()` statement. The `TRUE` condition matches *every row that has not yet been matched*, so if you use it first in the `case_when()` , it will match *everyone*!
The other reason is that there are certain cases where you *may* want to use open-ended overlapping conditions, and so you will have to pay attention to the order of conditions. Let's see one such example now: identifying COVID-like symptoms. Note that this is somewhat advanced material, likely a bit above your current needs. We are introducing it now so you are aware and can stay vigilant with `case_when()` in the future.
### Overlapping conditions within `case_when()`
We want to identify COVID-like symptoms in our data. Consider the symptoms columns in the `yaounde` data frame, which indicates which symptoms were experienced by respondents over a 6-month period:
```{r}
yaounde %>%
select(starts_with("symp_"))
```
We would like to use this to assess whether a person may have had COVID, partly following guidelines recommended by the [WHO](https://apps.who.int/iris/handle/10665/333752).
- Individuals with cough are to be classed as "possible COVID cases"
- Individuals with anosmia/ageusia (loss of smell or loss of taste) are to be classed as "probable COVID cases".
Now, keeping these criteria in mind, consider an individual, let's call her Osma, who has cough AND anosmia/ageusia? How should we classify Osma?
She meets the criteria for "possible COVID" (because she has cough), but she *also* meets the criteria for "probable COVID" (because she has anosmia/ageusia). So which group should she be classed as, "possible COVID" or "probable COVID"? Think about it for a minute.
Hopefully you guessed that she should be classed as a "probable COVID case". "Probable" is more likely than "Possible"; and the anosmia/ageusia symptom is more *significant* than the cough symptom. One might say that the criterion for "probable COVID" has a higher specificity or a higher *precedence* than the criterion for "possible COVID".
Therefore, when constructing a `case_when()` statement, the "probable COVID" condition should also take higher precedence---it should come *first* in the conditions provided to `case_when()`. Let's see this now.
First we select the relevant variables, for easy illustration. We also identify and `slice()` specific rows that are useful for the demonstration:
```{r}
yaounde_symptoms_slice <-
yaounde %>%
select(symp_cough, symp_anosmia_or_ageusia) %>%
# slice of specific rows useful for demo
# Once you find the right code, you would remove this slice
slice(32, 711, 625, 651 )
yaounde_symptoms_slice
```
Now, the correct `case_when()` statement, which has the "Probable COVID" condition first:
```{r}
yaounde_symptoms_slice %>%
mutate(covid_status = case_when(
symp_anosmia_or_ageusia == "Yes" ~ "Probable COVID",
symp_cough == "Yes" ~ "Possible COVID"
))
```
This `case_when()` statement can be read in simple terms as 'If the person has anosmia/ageusia, input "Probable COVID", and otherwise, if the person has cough, input "Possible COVID"'.
Now, spend some time looking through the output data frame, especially the last three individuals. The individual in row 2 meets the criterion for "Possible COVID" because they have cough (`symp_cough` == "Yes"), and the individual in row 3 meets the criterion for "Probable COVID" because they have anosmia/ageusia (`symp_anosmia_or_ageusia == "Yes"`).
The individual in row 4 is Osma, who both meets the criteria for "possible COVID" *and* for "probable COVID". And because we arranged our `case_when()` conditions in the right order, she is coded correctly as "probable COVID". Great!
But notice what happens if we swap the order of the conditions:
```{r}
yaounde_symptoms_slice %>%
mutate(covid_status = case_when(
symp_cough == "Yes" ~ "Possible COVID",
symp_anosmia_or_ageusia == "Yes" ~ "Probable COVID"
))
```
Oh no! Osma in row 4 is now misclassed as "Possible COVID" even though she has the more significant anosmia/ageusia symptom. This is because the first condition `symp_cough == "Yes"` matched her first, and so the second condition was not able to match her!
So now you see why you sometimes need to think deeply about the order of your `case_when()` conditions. It is a minor point, but it can bite you at unexpected times. Even experienced analysts tend to make mistakes that can be traced to improper arrangement of `case_when()` statements.
::: {.callout-note title='Challenge'}
In reality, there *is* still another solution to avoid misclassifying the person with cough and anosmia/ageusia. That is to add `symp_anosmia_or_ageusia != "Yes"` (not equal to "Yes") to the conditions for "Possible COVID". Can you think of why this works?
```{r}
yaounde_symptoms_slice %>%
mutate(covid_status = case_when(
symp_cough == "Yes" & symp_anosmia_or_ageusia != "Yes" ~ "Possible COVID",
symp_anosmia_or_ageusia == "Yes" ~ "Probable COVID"))
```
:::
::: {.callout-tip title='Practice'}
With the `flu_linelist` dataset, create a new column called `follow_up_priority` that implements the following schema:
- Women should be considered "High priority"
- All children (under 18 years) of any gender should be considered "Highest priority".
- Everyone else should have the value "No priority"
```{r eval = FALSE}
## Complete the code with your answer:
Q_priority_groups <-
flu_linelist %>%
mutate(follow_up_priority = ________________
)
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_priority_groups()
.HINT_Q_priority_groups()
```
:::
## Binary conditions: `dplyr::if_else()`
![Fig: the if_else() conditions.](images/custom_dplyr_if_else.png){width="400"}
There is another {dplyr} verb similar to `case_when()` for when we want to apply a binary condition to a variable: `if_else()`. A binary condition is either `TRUE` or `FALSE`.
`if_else()` has a similar application as `case_when()` : if the condition is true, then one operation is applied, if the condition is false, the alternative is applied. The syntax is: `if_else(CONDITION, IF_TRUE, IF_FALSE)`. As you can see, this only allows for a binary condition (not multiple cases, such as handled by `case_when()`).
If we take one of the first examples about recoding the `highest_education` variable, we can write it either with `case_when()` or with `if_else()`.
Here is the version we already explored:
```{r}
yaounde_educ %>%
mutate(
highest_education =
case_when(
highest_education %in% c("University", "Doctorate") ~ "Post-secondary",
TRUE ~ highest_education
)
)
```
And this is how we would write it using `if_else()`:
```{r}
yaounde_educ %>%
mutate(highest_education =
if_else(
highest_education %in% c("University", "Doctorate"),
# if TRUE then we recode
"Post-secondary",
# if FALSE then we keep default value
highest_education
))
```
As you can see, we get the same output, whether we use `if_else()` or `case_when()`.
::: {.callout-tip title='Practice'}
With the `flu_linelist` data, make a new column, called `age_group`, that has the value "Below 50" for people under 50 and "50 and above" for people aged 50 and up. Use the `if_else()` function.
This is exactly the same question as your first practice question, but this time you need to use `if_else()`.
```{r eval = FALSE}
## Complete the code with your answer:
Q_age_group_if_else <-
flu_linelist %>%
mutate(age_group = if_else(______________________________))
```
```{r include = FALSE}
## Check your answer
.CHECK_Q_age_group_if_else()
.HINT_Q_age_group_if_else()
```
:::
## Wrap up
Changing or constructing your variables based on conditions on other variables is one of the most repeated data wrangling tasks. To the point it deserved its very own lesson !
I hope now that you will feel comfortable using `case_when()` and `if_else()` within `mutate()` and that you are excited to learn more complex {dplyr} operations such as grouping variables and summarizing them.
See you next time!
![Fig: the if_else() and the \`case_when()\` conditions.](images/custom_dplyr_conditional.png){width="400"}
`r tgc_contributors_list(ids = c("lolovanco", "kendavidn"))`
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Horst, A. (2022). *Dplyr-learnr*. <https://github.com/allisonhorst/dplyr-learnr> (Original work published 2020)
- *Create, modify, and delete columns --- Mutate*. (n.d.). Retrieved 21 February 2022, from <https://dplyr.tidyverse.org/reference/mutate.html>
Artwork was adapted from:
- Horst, A. (2022). *R & stats illustrations by Allison Horst*. <https://github.com/allisonhorst/stats-illustrations> (Original work published 2018)
## Solutions
```{r}
.SOLUTION_Q_age_group()
.SOLUTION_Q_age_group_percentage()
.SOLUTION_Q_age_group_nas()
.SOLUTION_Q_gender_recode()
.SOLUTION_Q_recode_recovery()
.SOLUTION_Q_adolescent_grouping()
.SOLUTION_Q_age_province_grouping()
.SOLUTION_Q_priority_groups()
.SOLUTION_Q_age_group_if_else()
```