22-arrow.Rmd

# Arrow

**Learning objectives:**

-   Use {arrow} to load large data files into R efficiently
-   Partition large data files into parquet files for quicker access, less memory usage, and quicker wrangling 
-   Wrangle {arrow} data using existing {dplyr} operations

## Why learn {arrow}? {-}

-   CSV files = very common for ease of access and use
-   Big/messy CSVs = slow
-   {arrow} 📦 reads large datasets quickly & uses {dplyr} syntax

## Packages used {-}

```{r arrow-library, warning=F, message=F}
library(arrow)
library(curl)
library(duckdb)
library(tidyverse)
```

## Download data {-}

-   Case study: [item checkouts dataset from Seattle libraries](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6)
-   **DON'T `download.file()`!!!**  (41,389,465 rows of data)

```{r, eval=F}
dir.create("data", showWarnings = FALSE)
curl::multi_download(
  "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
  "data/seattle-library-checkouts.csv",
  resume = TRUE
)
```

## Open the data {-}

-   Size in memory ≈ 2 × size on disk
-   ~~`read_csv()`~~ ➡️ `arrow::open_dataset()`
-   Scans a few thousand rows to determine dataset structure
    -   `ISBN` is empty for 80k rows, so we specify
-   Does NOT load entire dataset into memory

```{r, eval=F}
seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  col_types = schema(ISBN = string()),
  format = "csv"
)
```

## Glimpse the data {-}

```{r, eval=F}
seattle_csv |> glimpse()
#> FileSystemDataset with 1 csv file
#> 41,389,465 rows x 12 columns
#> $ UsageClass      <string> "Physical", "Physical", "Digital", "Physical", "Ph…
#> $ CheckoutType    <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
#> $ MaterialType    <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
#> $ CheckoutYear     <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
#> $ CheckoutMonth    <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#> $ Checkouts        <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
#> $ Title           <string> "Super rich : a guide to having it all / Russell S…
#> $ ISBN            <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
#> $ Creator         <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
#> $ Subjects        <string> "Self realization, Conduct of life, Attitude Psych…
#> $ Publisher       <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…
```

## Manipulate the data {-}

-   Can use {dplyr} functions on data

```{r, eval=F}
seattle_csv |> 
  group_by(CheckoutYear) |> 
  summarise(Checkouts = sum(Checkouts)) |> 
  arrange(CheckoutYear) |> 
  collect()
#> # A tibble: 18 × 2
#>   CheckoutYear Checkouts
#>          <int>     <int>
#> 1         2005   3798685
#> 2         2006   6599318
#> 3         2007   7126627
#> 4         2008   8438486
#> 5         2009   9135167
#> 6         2010   8608966
#> # ℹ 12 more rows
```

## Parquet > CSV {-}

-   Slow: Manipulating large CSV datasets with {readr}
-   Faster: Manipulating large CSV datasets with {arrow}
-   Much faster: Manipulating large `parquet` datasets with {arrow}
    -   Data subdivided into multiple files 

## Benefits of parquet {-}

-   Smaller files than CSV (efficient encodings + compression)
-   Stores datatypes (vs CSV storing all as character & guessing)
-   "Column-oriented" ("thinks" like a dataframe)
-   Splits data into chunks you can (often) skip (faster)

**But:**

-   Not human-readable

## Partitioning {-}

-   Split data across files so analyses can skip unused data
-   Experiment to find best partition for your data
-   Recommendations:
    -   20 MB < Filesize < 2 GB
    -   <= 10,000 files

## Seattle library CSV to parquet {-}

-   `dplyr::group_by()` to define partitions
-   `arrow::write_dataset()` to save as parquet

```{r, eval=F}
pq_path <- "data/seattle-library-checkouts"
seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path, format = "parquet")
```

## Seattle library parquet files {-}

-   [Apache Hive](https://hive.apache.org/) "self-describing" directory/file names

```{r, eval=F}
tibble(
  files = list.files(pq_path, recursive = TRUE),
  size_MB = file.size(file.path(pq_path, files)) / 1024^2
)
#> # A tibble: 18 × 2
#>   files                            size_MB
#>   <chr>                              <dbl>
#> 1 CheckoutYear=2005/part-0.parquet    109.
#> 2 CheckoutYear=2006/part-0.parquet    164.
#> 3 CheckoutYear=2007/part-0.parquet    178.
#> 4 CheckoutYear=2008/part-0.parquet    195.
#> 5 CheckoutYear=2009/part-0.parquet    214.
#> 6 CheckoutYear=2010/part-0.parquet    222.
#> # ℹ 12 more rows
```

## parquet + {arrow} + {dplyr} {-}

```{r, eval=F}
seattle_pq <- open_dataset(pq_path)
query <- seattle_pq |> 
  filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
  group_by(CheckoutYear, CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(CheckoutYear, CheckoutMonth)
```

## Results (uncollected) {-}

```{r, eval=F}
query
#> FileSystemDataset (query)
#> CheckoutYear: int32
#> CheckoutMonth: int64
#> TotalCheckouts: int64
#> 
#> * Grouped by CheckoutYear
#> * Sorted by CheckoutYear [asc], CheckoutMonth [asc]
#> See $.data for the source Arrow object
```

## Results (collected) {-}

```{r, eval=F}
query |> collect()
#> # A tibble: 58 × 3
#> # Groups:   CheckoutYear [5]
#>   CheckoutYear CheckoutMonth TotalCheckouts
#>          <int>         <int>          <int>
#> 1         2018             1         355101
#> 2         2018             2         309813
#> 3         2018             3         344487
#> 4         2018             4         330988
#> 5         2018             5         318049
#> 6         2018             6         341825
#> # ℹ 52 more rows
```

## Available verbs {-}

-   [`?arrow-dplyr`](https://arrow.apache.org/docs/r/reference/acero.html) for supported functions
    -   (book uses `?acero` but that's way harder to remember)

## Performance {-}

```{r, eval=F}
x |> 
  filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
  group_by(CheckoutMonth) |>
  summarize(TotalCheckouts = sum(Checkouts)) |>
  arrange(desc(CheckoutMonth)) |>
  collect() |> 
  system.time()
```

-   CSV: 11.951s
-   Parquet: **0.263s**

## Using {duckdb} with {arrow} {-}

-   `arrow::to_duckdb()` translates {arrow} dataset to {duckdb}
    -   Outside R/memory
-   Discussion: Why? 
    -   More tidyverse verbs?

## Meeting Videos {-}

### Cohort 7 {-}

`r knitr::include_url("https://www.youtube.com/embed/qsgZAgmmpt4")`

<details>
  <summary> Meeting chat log </summary>
```
00:06:25	Oluwafemi Oyedele:	Hi Tim, Good Evening!!!
00:06:41	Tim Newby:	Hi Oluwafemi
00:07:10	Oluwafemi Oyedele:	We will start the meeting in 5 minute time!!!
00:08:15	Tim Newby:	Reacted to "We will start the me..." with 👍
00:11:45	Oluwafemi Oyedele:	https://github.com/hadley/r4ds/issues/1374
00:12:27	Oluwafemi Oyedele:	start
00:13:56	Oluwafemi Oyedele:	https://parquet.apache.org/
00:14:53	Oluwafemi Oyedele:	https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6
00:28:35	Oluwafemi Oyedele:	https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6
00:40:49	Oluwafemi Oyedele:	stop
```
</details>

### Cohort 8 {-}

`r knitr::include_url("https://www.youtube.com/embed/dUQ3vCNejbI")`