wrangle_subset.qmd

---
title: "Subsetting Data"
---

## Library Loading

Begin by loading any needed libraries.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Load the `tidyverse` meta-package.

```{r r-libs, warning = F, message = F}
library(tidyverse)
```

## [{{< fa brands python >}} Python]{.py}

Load the `pandas` and `os` libraries.

```{python py-libs}
# Load needed libraries
import os
import pandas as pd
```
:::

## Conditionals

Often when we work with data we want to retrieve only the rows that meet some condition(s). These conditions can be defined in code by using "relational operators". Fortunately, the fundamental relational operators are the same between [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r}!

Conditional statements always return "boolean" [BOO-lee-un] values. These are [type]{.py} boolean and [class]{.r} logical in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

In [{{< fa brands r-project >}} R]{.r} `TRUE` and `FALSE` are booleans. Note that their abbreviations `T` and `F` are also accepted though they _must_ be capitalized.

```{r r-bool1}
# Assess whether a number equals itself and assign to an object
bool_r <- 20 == 20

# See what that object contains
bool_r
```

```{r r-bool2}
# And check class
class(bool_r)
```

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py} `True` or `False` are booleans. Note that they must be capitalized in this way to register as the correct type (i.e., only first letter capitalized).

```{python py-bool1}
# Assess whether a number equals itself and assign to an object
bool_py = 20 == 20

# See what that object contains
bool_py
```

```{python py-bool2}
# Also check type
type(bool_py)
```
:::

In addition to asking whether something 'is exactly equal to' something else (`==`), we can also ask whether two values are _not_ equal (`!=`) or pose greater/less than conditionals (`>`/`<`).

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

```{r r-bool3}
# Ask whether a number is less than or equal to a particular value
7 <= 5
```

## [{{< fa brands python >}} Python]{.py}

```{python py-bool3}
# Ask whether a number is less than or equal to a particular value
7 <= 5
```
:::

## Subset with One Condition

We can leverage these conditional values to return only rows of a data table that meet criteria that are valuable to us. Let's begin by loading the external dataset derived from the `lterdatasampler` [{{< fa brands r-project >}} R]{.r} package (see [here](https://lter.github.io/lterdatasampler/)) that we used in the previous section.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Read in vertebrate data CSV and check out the first few rows.

```{r r-ext-data}
# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))

# Check out first few rows
head(vert_r, n = 3)
```

Recall from our file path module that the `file.path` function accounts for computer operating system differences.

## [{{< fa brands python >}} Python]{.py}

Read in vertebrate data CSV (remember we _must_ namespace the `read_csv` function) and check out the first few rows.

```{python py-ext-data}
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))

# Check out first few rows
vert_py.head(3)
```

Recall from our file path module that the `join` function accounts for computer operating system differences.
:::

Now that we have some data we can use conditional statements to actually subset it! We'll use the `len` and `nrow` functions in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively to demonstrate that we successfully subset. Either will show whether we've successfully removed the rows that don't meet our criteria.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Let's subset to only one particular site of vertebrate data. Note that base [{{< fa brands r-project >}} R]{.r} does include a `subset` function but we'll use the equivalent `filter` function from the `dplyr` package.

```{r r-subset1}
# Subset to only site "MACKOG-U"
r_sub1 <- dplyr::filter(vert_r, sitecode == "MACKOG-U")

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_sub1))
```

## [{{< fa brands python >}} Python]{.py}

Let's subset to only one particular site of vertebrate data.

```{python py-subset1}
# Subset to only site "MACKOG-U"
py_sub1 = vert_py[vert_py.sitecode == "MACKOG-U"]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_sub1))
```
:::

## Subset with Multiple Conditions

If desired we can also specify multiple criteria to subset by. We must decide whether _all_ criteria must be met or if _any_ criterion being met is sufficient to retain the row. If we want **all** (in either language) we need to separate each criterion with an ampersand (`&`). If instead we want **any** (in either language) we need to separate each criterion with a pipe (`|`); note that this is _not_ the same as a "pipe operator" which we'll discuss in detail later.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a `&` between conditions.

```{r r-subset2}
# Subset to only data from after 1990 before 1995 (inclusive)
r_sub2 <- dplyr::filter(vert_r, year >= 1990 & year <= 1995)

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_sub2))
```

## [{{< fa brands python >}} Python]{.py}

If we want a only values between a minimum and maximum value that means we need all criteria to be met so we need to use a `&` between conditions. Note that when we specify multiple criteria in [{{< fa brands python >}} Python]{.py} we _must_ wrap each conditional in parentheses.

```{python py-subset2}
# Subset to only data from after 1990 before 1995 (inclusive)
py_sub2 = vert_py[(vert_py.year >= 1990) & (vert_py.year <= 1995)]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_sub2))
```
:::

## Subset with Masks

We can define "boolean masks" when we want to leverage a helper function that tests for specific crtieria. For instance we can use the `isin` [method]{.py} or the `%in%` [operator]{.r} (in [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} respectively) to ask whether the value in a given row matches any of a set of options. This is much simpler than writing out one "or" criterion for every option individually.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

Let's subset to only salamander species in the dataset using a mask. In [{{< fa brands r-project >}} R]{.r}, that means providing the column name to the left of the `%in%` operator and giving a vector of options to the right.

```{r r-mask1}
# Make the subset
r_mask1 <- dplyr::filter(vert_r, species %in% c("Coastal giant salamander", "Cascade torrent salamander"))

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_mask1))
```

## [{{< fa brands python >}} Python]{.py}

Let's subset to only salamander species in the dataset using a mask. In [{{< fa brands python >}} Python]{.py}, `isin` is a method so it is appended to the right of the column it accesses (separated by a period) and the options must be provided as a list (i.e., wrapped in square brackets and separated by commas).

```{python py-mask1}
# Make the subset
py_mask1 = vert_py[vert_py.species.isin(["Coastal giant salamander", "Cascade torrent salamander"])]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_mask1))
```
:::

## Negating Criteria

Sometimes it is helpful to _negate_ criteria and subset to only conditions that _don't_ meet our criteria. This works with either simpler conditional statements or with masks! The symbol we use to do this negation differs between the two languages but in either it is placed to the left of the criterion it is negating.

In the below example we'll use the `isnull` [function]{.py} or the `is.na` [function]{.r} to see whether a column contains missing values and invert it to return only rows where the specified column _is not_ missing values.

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

In [{{< fa brands r-project >}} R]{.r} we negate conditions by adding an exclamation point (`!`) to the left of the relevant criterion.

```{r r-mask2}
# Subset to only species that *have* a recorded value in the "length_2_mm" column
r_mask2 <- dplyr::filter(vert_r, !is.na(length_2_mm))

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_mask2))
```

## [{{< fa brands python >}} Python]{.py}

In [{{< fa brands python >}} Python]{.py} we negate conditions by adding a tilde (`~`) to the left of the relevant criterion.

```{python py-mask2}
# Subset to only species that *have* a recorded value in the "length_2_mm" column
py_mask2 = vert_py[~pd.isnull(vert_py.length_2_mm)]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_mask2))
```
:::

## Partial String Matching

We can also filter by instances where part of a larger [string]{.py}/[character]{.r} matches a specific pattern. This type of criterion is known as "partial string matching".

:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}

```{r r-partialstring}
# Subset to only rows where the 'species' column includes the word "salamander"
r_part <- dplyr::filter(vert_r, stringr::str_detect(string = species, pattern = "salamander"))

# Check whether that worked
message("Before subsetting the data had ", nrow(vert_r), " rows.")
message("Subsetting changed this to ", nrow(r_part))
```

## [{{< fa brands python >}} Python]{.py}

```{python py-partialstring}
# Remove rows missing species information
py_knownsp = vert_py[~pd.isnull(vert_py.species)]

# Subset to only rows where the 'species' column includes the word "salamander"
py_part = py_knownsp[py_knownsp["species"].str.contains("salamander")]

# Check whether that worked
print("Before subsetting the data had", len(vert_py), "rows.")
print("Subsetting changed this to", len(py_part))
```
:::