-
Notifications
You must be signed in to change notification settings - Fork 0
/
wrangle_subset.qmd
249 lines (171 loc) · 9.41 KB
/
wrangle_subset.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: "Subsetting Data"
---
## Library Loading
Begin by loading any needed libraries.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `tidyverse` meta-package.
```{r r-libs, warning = F, message = F}
library(tidyverse)
```
## [{{< fa brands python >}} Python]{.py}
Load the `pandas` and `os` libraries.
```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::
## Conditionals
Often when we work with data we want to retrieve only the rows that meet some condition(s). These conditions can be defined in code by using "relational operators". Fortunately, the fundamental relational operators are the same between [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r}!
Conditional statements always return "boolean" [BOO-lee-un] values. These are [type]{.py} boolean and [class]{.r} logical in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r} `TRUE` and `FALSE` are booleans. Note that their abbreviations `T` and `F` are also accepted though they _must_ be capitalized.
```{r r-bool1}
# Assess whether a number equals itself and assign to an object
bool_r <- 20 == 20
# See what that object contains
bool_r
```
```{r r-bool2}
# And check class
class(bool_r)
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py} `True` or `False` are booleans. Note that they must be capitalized in this way to register as the correct type (i.e., only first letter capitalized).
```{python py-bool1}
# Assess whether a number equals itself and assign to an object
bool_py = 20 == 20
# See what that object contains
bool_py
```
```{python py-bool2}
# Also check type
type(bool_py)
```
:::
In addition to asking whether something 'is exactly equal to' something else (`==`), we can also ask whether two values are _not_ equal (`!=`) or pose greater/less than conditionals (`>`/`<`).
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
```{r r-bool3}
# Ask whether a number is less than or equal to a particular value
7 <= 5
```
## [{{< fa brands python >}} Python]{.py}
```{python py-bool3}
# Ask whether a number is less than or equal to a particular value
7 <= 5
```
:::
## Subsetting with One Condition
We can leverage these conditional values to return only rows of a data table that meet criteria that are valuable to us. Let's begin by loading the external dataset derived from the `lterdatasampler` [{{< fa brands r-project >}} R]{.r} package (see [here](https://lter.github.io/lterdatasampler/)) that we used in the previous section.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Read in vertebrate data CSV and check out the first few rows.
```{r r-ext-data}
# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))
# Check out first few rows
head(vert_r, n = 3)
```
Recall from our file path module that the `file.path` function accounts for computer operating system differences.
## [{{< fa brands python >}} Python]{.py}
Read in vertebrate data CSV (remember we _must_ namespace the `read_csv` function) and check out the first few rows.
```{python py-ext-data}
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
# Check out first few rows
vert_py.head(3)
```
Recall from our file path module that the `join` function accounts for computer operating system differences.
:::
Now that we have some data we can use conditional statements to actually subset it! We'll use the `len` and `nrow` functions in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively to demonstrate that we successfully subset. Either will show whether we've successfully removed the rows that don't meet our criteria.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Let's subset to only one particular site of vertebrate data. Note that base [{{< fa brands r-project >}} R]{.r} does include a `subset` function but we'll use the equivalent `filter` function from the `dplyr` package.
```{r r-subset1}
# Subset to only site "MACKOG-U"
r_sub1 <- dplyr::filter(vert_r, sitecode == "MACKOG-U")
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_sub1))
```
## [{{< fa brands python >}} Python]{.py}
Let's subset to only one particular site of vertebrate data.
```{python py-subset1}
# Subset to only site "MACKOG-U"
py_sub1 = vert_py[vert_py.sitecode == "MACKOG-U"]
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_sub1))
```
:::
## Subsetting with Multiple Conditions
If desired we can also specify multiple criteria to subset by. We must decide whether _all_ criteria must be met or if _any_ criterion being met is sufficient to retain the row. If we want **all** (in either language) we need to separate each criterion with an ampersand (`&`). If instead we want **any** (in either language) we need to separate each criterion with a pipe (`|`); note that this is _not_ the same as a "pipe operator" which we'll discuss in detail later.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a `&` between conditions.
```{r r-subset2}
# Subset to only data from after 1990 before 1995 (inclusive)
r_sub2 <- dplyr::filter(vert_r, year >= 1990 & year <= 1995)
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_sub2))
```
## [{{< fa brands python >}} Python]{.py}
If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a `&` between conditions. Note that when we specify multiple criteria in [{{< fa brands python >}} Python]{.py} we _must_ wrap each conditional in parentheses.
```{python py-subset2}
# Subset to only data from after 1990 before 1995 (inclusive)
py_sub2 = vert_py[(vert_py.year >= 1990) & (vert_py.year <= 1995)]
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_sub2))
```
:::
## Subsetting with Masks
We can define "boolean masks" when we want to leverage a helper function that tests for specific crtieria. For instance we can use the `isin` [method]{.py} or the `%in%` [operator]{.r} (in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively) to ask whether the value in a given row matches any of a set of options. This is much simpler than writing out one "or" criterion for every option individually.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Let's subset to only salamander species in the dataset using a mask. In [{{< fa brands r-project >}} R]{.r}, that means providing the column name to the left of the `%in%` operator and giving a vector of options to the right.
```{r r-mask1}
# Make the subset
r_mask1 <- dplyr::filter(vert_r, species %in% c("Coastal giant salamander", "Cascade torrent salamander"))
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_mask1))
```
## [{{< fa brands python >}} Python]{.py}
Let's subset to only salamander species in the dataset using a mask. In [{{< fa brands python >}} Python]{.py}, `isin` is a method so it is appended to the right of the column it accesses (separated by a period) and the options must be provided as a list (i.e., wrapped in square brackets and separated by commas).
```{python py-mask1}
# Make the subset
py_mask1 = vert_py[vert_py.species.isin(["Coastal giant salamander", "Cascade torrent salamander"])]
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_mask1))
```
:::
## Negating Criteria
Sometimes it is helpful to _negate_ criteria and subset to only conditions that _don't_ meet our criteria. This works with either simpler conditional statements or with masks! The symbol we use to do this negation differs between the two languages but in either it is placed to the left of the criterion it is negating.
In the below example we'll use the `isnull` [function]{.py} or the `is.na` [function]{.r} to see whether a column contains missing values and invert it to return only rows where the specified column _is not_ missing values.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r} we negate conditions by adding an exclamation point (`!`) to the left of the relevant criterion.
```{r r-mask2}
# Subset to only species that *have* a recorded value in the "length_2_mm" column
r_mask2 <- dplyr::filter(vert_r, !is.na(length_2_mm))
# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_mask2))
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py} we negate conditions by adding a tilde (`~`) to the left of the relevant criterion.
```{python py-mask2}
# Subset to only species that *have* a recorded value in the "length_2_mm" column
py_mask2 = vert_py[~pd.isnull(vert_py.length_2_mm)]
# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_mask2))
```
:::