Quick start guide for new users working with Hive partitioned data in R or Python.
The example data are generated by running a container, so you will need Podman or Docker installed.
generate_test_data.R
standalone script inR
folder to make some sample data, which is also accessed from the containergenerate_test_data_2.R
standalone script inR
folder to make some sample data with persistent cash balances but no parallel processing. Can change Dockerfile to use this in the container alternativelyDockerfile
instruction set for building the data generator containerpd_datazone.csv
CSV file in data-input folder containing list of LSOAs (A.K.A Data Zones in Scotland) and their corresponding postal districts. These are accessed by the data generation script and copied into the container. This file contains columns 2 and 51 of the table accessible here. The postcodes were converted to postal districts by taking the portion of the postcode before the space (e.g.AB1 2CD
becomesAB1
). Duplicates and rows with missing LSOA values were removed.
To create the data in a container using Podman or Docker:
- Build the container
podman build --platform linux/amd64 -t demo_data .
- Run the container
podman run --name demo_data demo_data
- To opt into parallel processing, do
podman run --name demo_data -e PARALLEL_WORKERS=n demo_data
wheren
is the number of cores to use
- To opt into parallel processing, do
- Fetch the sample data from the container with:
podman cp demo_data:data-output/test_data/ data-output/
Caution
When running the container with parallelisation be sensible with the number of workers. The program will crash if you provide more than can be used, surfacing an error starting with MultisessionFuture (future_lapply-6) failed to receive message results from cluster RichSOCKnode
. If this happens, reduce the number of workers and try again.
R_hive_example.qmd
quarto doc with R code chunksR_hive_example.md
read this file in Github for the worked examples- python example - WIP
If you're interested in some Arrow benchmarks, check out this repo: https://github.com/mikerspencer/arrow_test/, which was used for an EdinbR talk.