-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathuntangled_ls02_filter.qmd
385 lines (273 loc) · 12.9 KB
/
untangled_ls02_filter.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: 'Filtering rows'
output:
html_document:
number_sections: true
toc: true
toc_float: true
css: !expr here::here("global/style/style.css")
highlight: kate
editor_options:
chunk_output_type: console
---
```{r, echo = F, message = F, warning = F}
## Load packages
if(!require(pacman)) install.packages("pacman")
pacman::p_load(rlang, tidyverse, knitr, here)
### functions
source(here::here("global/functions/misc_functions.R"))
### default render
registerS3method("reactable_5_rows", "data.frame", reactable_5_rows)
knitr::opts_chunk$set(class.source = "tgc-code-block", render = head_5_rows)
### autograders
suppressMessages(source(here::here("lessons/ls02_filter_autograder.R")))
```
## Intro
Onward with the {dplyr} package, discovering the `filter` verb. Last time we saw how to `select` variables (columns) and today we will see how to keep or drop data entries, rows, using `filter`. Dropping abnormal data entries or keeping subsets of your data points is another essential aspect of data wrangling.
Let's go !
![Fig: the filter() verb.](images/custom_dplyr_filter.png){width="400"}
## Learning objectives
1. You can use `dplyr::filter()` to keep or drop rows from a dataframe.
2. You can filter rows by specifying conditions on numbers or strings using relational operators like greater than (`>`), less than (`<`), equal to (`==`), and not equal to (`!=`).
3. You can filter rows by combining conditions using logical operators like the ampersand (`&`) and the vertical bar (`|`).
4. You can filter rows by negating conditions using the exclamation mark (`!`) logical operator.
5. You can filter rows with missing values using the `is.na()` function.
## The Yaounde COVID-19 dataset
In this lesson, we will again use the data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
```{r, message = F}
yaounde <- read_csv(here::here('data/yaounde_data.csv'))
### a smaller subset of variables
yao <- yaounde %>%
select(age, sex, weight_kg, highest_education, neighborhood,
occupation, is_smoker, is_pregnant,
igg_result, igm_result)
yao
```
## Introducing `filter()`
We use `filter()` to keep rows that satisfy a set of conditions. Let's take a look at a simple example. If we want to keep just the male records, we run:
```{r}
yao %>% filter(sex == "Male")
```
Note the use of the double equal sign `==` rather than the single equal sign `=`. The `==` sign tests for equality, as demonstrated below:
```{r, render = normal_print}
### create the object `sex_vector` with three elements
sex_vector <- c("Male", "Female", "Female")
### test which elements are equal to "Male"
sex_vector == "Male"
```
So the code `yao %>% filter(sex == "Male")` will keep all rows where the equality test `sex == "Male"` evaluates to `TRUE`.
------------------------------------------------------------------------
It is often useful to chain `filter()` with `nrow()` to get the number of rows fulfilling a condition.
```{r, render = normal_print}
### how many respondents were male?
yao %>%
filter(sex == "Male") %>%
nrow()
```
::: {.callout-note title='Key Point'}
The double equal sign, `==`, tests for equality, while the single equals sign, `=`, is used for specifying values to arguments inside functions.
:::
::: {.callout-tip title='Practice'}
Filter the `yao` data frame to respondents who were pregnant during the survey.
```{r, eval = F, echo = FALSE}
## Since this is the first question, we'll give you the answer directly.
## Your answer should look like this:
Q_is_pregnant<- yao %>% filter(is_pregnant == "Yes")
## Run that line, then run the CHECK and HINT functions below
```
```{r, eval = F, echo = FALSE}
.CHECK_Q_is_pregnant()
.HINT_Q_is_pregnant()
```
```{r, eval = F, echo = FALSE}
## Now, to obtain the solution, run the line below!
.SOLUTION_Q_is_pregnant()
## Each question has a solution function similar to this.
## (Where HINT is replaced with SOLUTION in the function name.)
## But you will need to type out the function name on your own.
## (This is to discourage you from looking at the solution before answering the question.)
```
How many respondents were female? (Use `filter()` and `nrow()`)
```{r, eval = F, echo = FALSE}
Q_female_nrow <- yao ___________________-
```
```{r, eval = F, echo = FALSE}
.CHECK_Q_female_nrow()
.HINT_Q_female_nrow()
```
:::
## Relational operators
The `==` operator introduced above is an example of a "relational" operator, as it tests the relation between two values. Here is a list of some of these operators:
| | |
|:-------------|:------------------------------------|
| **Operator** | **is TRUE if** |
| A \< B | A is **less than** B |
| A \<= B | A is **less than or equal** to B |
| A \> B | A is **greater than** B |
| A \>= B | A is **greater than or equal to** B |
| A == B | A is **equal** to B |
| A != B | A is **not equal** to B |
| A %in% B | A **is an element of** B |
![Fig: AND and OR operators visualized.](images/venn_diagram_and_or.png){width="450"}
Let's see how to use these within `filter()`:
```{r}
yao %>% filter(sex != "Male") ## keep rows where `sex` is not "Male"
yao %>% filter(age < 6) ## keep respondents under 6
yao %>% filter(age >= 70) ## keep respondents aged at least 70
### keep respondents whose highest education is "Primary" or "Secondary"
yao %>% filter(highest_education %in% c("Primary", "Secondary"))
```
::: {.callout-tip title='Practice'}
From `yao`, keep only respondents who were children (under 18).
```{r, eval = F, echo = FALSE}
Q_under_18 <- "YOUR_ANSWER_HERE"
.CHECK_Q_under_18()
.HINT_Q_under_18()
```
With `%in%`, keep only respondents who live in the "Tsinga" or "Messa" neighborhoods.
```{r, eval = F, echo = FALSE}
Q_tsinga_messa <- "YOUR_ANSWER_HERE"
.CHECK_Q_tsinga_messa()
.HINT_Q_tsinga_messa()
```
:::
## Combining conditions with `&` and `|`
We can pass multiple conditions to a single `filter()` statement separated by commas:
```{r}
### keep respondents who are pregnant and are ex-smokers
yao %>% filter(is_pregnant == "Yes", is_smoker == "Ex-smoker") ## only one row
```
When multiple conditions are separated by a comma, they are implicitly combined with an **and** (`&`).
It is best to replace the comma with `&` to make this more explicit.
```{r}
### same result as before, but `&` is more explicit
yao %>% filter(is_pregnant == "Yes" & is_smoker == "Ex-smoker")
```
::: {.callout-note title='Side Note'}
Don't confuse:
- the "," in listing several conditions in filter `filter(A,B)` i.e. filter based on condition A and (`&`) condition B
- the "," in lists `c(A,B)` which is listing different components of the list (and has nothing to do with the `&` operator)
:::
If we want to combine conditions with an **or**, we use the vertical bar symbol, `|`.
```{r}
### respondents who are pregnant OR who are ex-smokers
yao %>% filter(is_pregnant == "Yes" | is_smoker == "Ex-smoker")
```
::: {.callout-tip title='Practice'}
Filter `yao` to only keep men who tested IgG positive.
```{r, eval = F, echo = FALSE}
Q_male_positive <- "YOUR_ANSWER_HERE"
.CHECK_Q_male_positive()
.HINT_Q_male_positive()
```
Filter `yao` to keep both children (under 18) and anyone whose highest education is primary school.
```{r, eval = F, echo = FALSE}
Q_child_primary <- "YOUR_ANSWER_HERE"
.CHECK_Q_child_primary()
.HINT_Q_child_primary()
```
:::
## Negating conditions with `!`
To negate conditions, we wrap them in `!()`.
Below, we drop respondents who are children (less than 18 years) or who weigh less than 30kg:
```{r}
### drop respondents < 18 years OR < 30 kg
yao %>% filter(!(age < 18 | weight_kg < 30))
```
The `!` operator is also used to negate `%in%` since R does not have an operator for **NOT in**.
```{r}
### drop respondents whose highest education is NOT "Primary" or "Secondary"
yao %>% filter(!(highest_education %in% c("Primary", "Secondary")))
```
::: {.callout-note title='Key Point'}
It is easier to read `filter()` statements as **keep** statements, to avoid confusion over whether we are filtering **in** or filtering **out**!
So the code below would read: "**keep** respondents who are under 18 or who weigh less than 30kg".
```{r, eval = F}
yao %>% filter(age < 18 | weight_kg < 30)
```
And when we wrap conditions in `!()`, we can then read `filter()` statements as **drop** statements.
So the code below would read: "**drop** respondents who are under 18 or who weigh less than 30kg".
```{r, eval = F}
yao %>% filter(!(age < 18 | weight_kg < 30))
```
:::
::: {.callout-tip title='Practice'}
From `yao`, drop respondents who live in the Tsinga or Messa neighborhoods.
```{r, eval = F, echo = FALSE}
Q_not_tsinga_messa <- "YOUR_ANSWER_HERE"
.CHECK_Q_not_tsinga_messa()
.HINT_Q_not_tsinga_messa()
```
:::
## `NA` values
The relational operators introduced so far do not work with `NA`.
Let's make a data subset to illustrate this.
```{r}
yao_mini <- yao %>%
select(sex, is_pregnant) %>%
slice(1,11,50,2) ## custom row order
yao_mini
```
In `yao_mini`, the last respondent has an `NA` for the `is_pregnant` column, because he is male.
Trying to select this row using `== NA` will not work.
```{r}
yao_mini %>% filter(is_pregnant == NA) ## does not work
yao_mini %>% filter(is_pregnant == "NA") ## does not work
```
This is because `NA` is a non-existent value. So R cannot evaluate whether it is "equal to" or "not equal to" anything.
The special function `is.na()` is therefore necessary:
```{r}
### keep rows where `is_pregnant` is NA
yao_mini %>% filter(is.na(is_pregnant))
```
This function can be negated with `!`:
```{r}
### drop rows where `is_pregnant` is NA
yao_mini %>% filter(!is.na(is_pregnant))
```
::: {.callout-note title='Side Note'}
For tibbles, RStudio will highlight `NA` values bright red to distinguish them from other values:
![A common error with `NA`](images/tibble_with_na.png){width="300"}
:::
::: {.callout-note title='Side Note'}
`NA` values can be identified but any other encoding such as `"NA"` or `"NaN"`, which are encoded as strings, will be imperceptible to the functions (they are strings, like any others).
:::
::: {.callout-tip title='Practice'}
From the `yao` dataset, keep all the respondents who had missing records for the report of their smoking status.
```{r, eval = F, echo = FALSE}
Q_na_smoker <- "YOUR_ANSWER_HERE"
.CHECK_Q_na_smoker()
.HINT_Q_na_smoker()
```
:::
::: {.callout-tip title='Practice'}
For some respondents the respiration rate, in breaths per minute, was recorded in the `respiration_frequency` column.
From `yaounde`, drop those with a respiration frequency under 20. Think about NAs while doing this! You should avoid also dropping the NA values.
```{r, eval = F, echo = FALSE}
Q_drop_resp_under_20 <- "YOUR_ANSWER_HERE"
.CHECK_Q_drop_resp_under_20()
.HINT_Q_drop_resp_under_20()
```
:::
## Wrap up
Now you know the two essential verbs to `select()` columns and to `filter()` rows. This way you keep the variables you are interested in by selecting your columns and you keep the data entries you judge relevant by filtering your rows.
But what about modifying, transforming your data? We will learn about this in the next lesson. See you there!
![Fig: Basic Data Wrangling: `select()` and `filter()`.](images/custom_dplyr_basic_2.png){width="400"}
`r tgc_contributors_list(ids = c("lolovanco", "avallecam", "kendavidn"))`
## References {.unlisted .unnumbered}
Some material in this lesson was adapted from the following sources:
- Horst, A. (2021). *Dplyr-learnr*. <https://github.com/allisonhorst/dplyr-learnr> (Original work published 2020)
- *Subset rows using column values---Filter*. (n.d.). Retrieved 12 January 2022, from <https://dplyr.tidyverse.org/reference/filter.html>
Artwork was adapted from:
- Horst, A. (2021). *R & stats illustrations by Allison Horst*. <https://github.com/allisonhorst/stats-illustrations> (Original work published 2018)
## Solutions
```{r}
.SOLUTION_Q_is_pregnant()
.SOLUTION_Q_female_nrow()
.SOLUTION_Q_under_18()
.SOLUTION_Q_tsinga_messa()
.SOLUTION_Q_male_positive()
.SOLUTION_Q_child_primary()
.SOLUTION_Q_not_tsinga_messa()
.SOLUTION_Q_na_smoker()
```