Hive guide

Quick start guide for new users working with Hive partitioned data in R or Python.

The example data are generated by running a container, so you will need Podman or Docker installed.

Files

Generate example data

generate_test_data.R standalone script in R folder to make some sample data, which is also accessed from the container
generate_test_data_2.R standalone script in R folder to make some sample data with persistent cash balances but no parallel processing. Can change Dockerfile to use this in the container alternatively
Dockerfile instruction set for building the data generator container
pd_datazone.csv CSV file in data-input folder containing list of LSOAs (A.K.A Data Zones in Scotland) and their corresponding postal districts. These are accessed by the data generation script and copied into the container. This file contains columns 2 and 51 of the table accessible here. The postcodes were converted to postal districts by taking the portion of the postcode before the space (e.g. AB1 2CD becomes AB1). Duplicates and rows with missing LSOA values were removed.

To create the data in a container using Podman or Docker:

Build the container podman build --platform linux/amd64 -t demo_data .
Run the container podman run --name demo_data demo_data
1. To opt into parallel processing, do podman run --name demo_data -e PARALLEL_WORKERS=n demo_data where n is the number of cores to use
Fetch the sample data from the container with: podman cp demo_data:data-output/test_data/ data-output/

Caution

When running the container with parallelisation be sensible with the number of workers. The program will crash if you provide more than can be used, surfacing an error starting with MultisessionFuture (future_lapply-6) failed to receive message results from cluster RichSOCKnode. If this happens, reduce the number of workers and try again.

Hive examples

R_hive_example.qmd quarto doc with R code chunks
R_hive_example.md read this file in Github for the worked examples
python example - WIP

The need for speed

If you're interested in some Arrow benchmarks, check out this repo: https://github.com/mikerspencer/arrow_test/, which was used for an EdinbR talk.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
R		R
data-input		data-input
data-output		data-output
.dockerignore		.dockerignore
.future.R		.future.R
.gitignore		.gitignore
.lintr		.lintr
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
R_hive_example.md		R_hive_example.md
R_hive_example.qmd		R_hive_example.qmd
hive_guide.Rproj		hive_guide.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hive guide

Files

Generate example data

Hive examples

The need for speed

About

Releases

Packages

Contributors 3

Languages

License

smartdatafoundry/hive_guide

Folders and files

Latest commit

History

Repository files navigation

Hive guide

Files

Generate example data

Hive examples

The need for speed

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages