This repository contains the materials for the meetup on "Unlocking Big Data in R Using Arrow" I presented to Oman R Users on Nov 8 2023.
Explore the nuances of handling large datasets in R through the Arrow package. This session aims to provide an understanding of Arrow's capabilities, detailing its application in real-world scenarios. It's a package that's not only easy to adopt, but one that will drastically improve your capability to handle massive datasets in R.
You'll need to download the datasets from the sources and place them in the data
folder to run the code.
To replicate the dataset locally, run the following code:
library(arrow)
library(dplyr)
local_folder <- here::here("data/nyc_part")
fs::dir_create(local_folder)
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
group_by(year, month) |>
write_dataset(local_folder)
Create a folder called airlines
in the data
folder.
Download the Combined_Flights_2021
CSV and parquet files from the Flight Status Prediction dataset from Kaggle.
Links:
📽️ The slides are created using quarto, in the presentation.qmd
file. The slide deck is published on GitHub pages here.
The examples shown in my talk are stored in the code
folder.