Skip to content

Explore the nuances of handling large datasets in R through the Arrow package.

Notifications You must be signed in to change notification settings

rsangole/oman-rusers-arrow

Repository files navigation

[Oman R Users] Unlocking Big Data in R Using Arrow

This repository contains the materials for the meetup on "Unlocking Big Data in R Using Arrow" I presented to Oman R Users on Nov 8 2023.

Abstract

Explore the nuances of handling large datasets in R through the Arrow package. This session aims to provide an understanding of Arrow's capabilities, detailing its application in real-world scenarios. It's a package that's not only easy to adopt, but one that will drastically improve your capability to handle massive datasets in R.

Data Sources

You'll need to download the datasets from the sources and place them in the data folder to run the code.

NYC Taxi Data

To replicate the dataset locally, run the following code:

library(arrow)
library(dplyr)

local_folder <- here::here("data/nyc_part")

fs::dir_create(local_folder)

open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
    filter(year %in% 2012:2021) |>
    group_by(year, month) |>
    write_dataset(local_folder)

Airlines Data

Create a folder called airlines in the data folder.

Download the Combined_Flights_2021 CSV and parquet files from the Flight Status Prediction dataset from Kaggle.

Links:

  1. CSV file
  2. Parquet file

Slides

📽️ The slides are created using quarto, in the presentation.qmd file. The slide deck is published on GitHub pages here.

Code

The examples shown in my talk are stored in the code folder.

About

Explore the nuances of handling large datasets in R through the Arrow package.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published