Skip to content

Quick start guide for new users working with Hive partitioned data.

License

Notifications You must be signed in to change notification settings

smartdatafoundry/hive_guide

Repository files navigation

Hive guide

Quick start guide for new users working with Hive partitioned data in R or Python.

The example data are generated by running a container, so you will need Podman or Docker installed.

Files

Generate example data

  • generate_test_data.R standalone script in R folder to make some sample data, which is also accessed from the container
  • generate_test_data_2.R standalone script in R folder to make some sample data with persistent cash balances but no parallel processing. Can change Dockerfile to use this in the container alternatively
  • Dockerfile instruction set for building the data generator container
  • pd_datazone.csv CSV file in data-input folder containing list of LSOAs (A.K.A Data Zones in Scotland) and their corresponding postal districts. These are accessed by the data generation script and copied into the container. This file contains columns 2 and 51 of the table accessible here. The postcodes were converted to postal districts by taking the portion of the postcode before the space (e.g. AB1 2CD becomes AB1). Duplicates and rows with missing LSOA values were removed.

To create the data in a container using Podman or Docker:

  1. Build the container podman build --platform linux/amd64 -t demo_data .
  2. Run the container podman run --name demo_data demo_data
    1. To opt into parallel processing, do podman run --name demo_data -e PARALLEL_WORKERS=n demo_data where n is the number of cores to use
  3. Fetch the sample data from the container with: podman cp demo_data:data-output/test_data/ data-output/

Caution

When running the container with parallelisation be sensible with the number of workers. The program will crash if you provide more than can be used, surfacing an error starting with MultisessionFuture (future_lapply-6) failed to receive message results from cluster RichSOCKnode. If this happens, reduce the number of workers and try again.

Hive examples

  • R_hive_example.qmd quarto doc with R code chunks
  • R_hive_example.md read this file in Github for the worked examples
  • python example - WIP

The need for speed

If you're interested in some Arrow benchmarks, check out this repo: https://github.com/mikerspencer/arrow_test/, which was used for an EdinbR talk.

About

Quick start guide for new users working with Hive partitioned data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •