Using Poetry:
poetry install
This will install packages from the lockfile and ensure that you're using the exact same environment of packages as me.
Download source files from Microsoft's website:
> mkdir -p data/source/
# With 8 threads:
> cat files.txt | xargs -n 1 -P 8 wget -q -P data/source/
The data are distributed by Microsoft in zipped GeoJSON, which is not a performant format to load. To make later steps faster, we'll convert all the input files into Parquet. This uses the osgeo/gdal:latest
image (as of May 7, 2022) for simplicity.
cd data
mkdir -p preprocessed
for file in $(ls source/*.zip); do
state=$(basename $file
echo $state
docker run --rm -it -v $(pwd):/data osgeo/gdal:latest \
ogr2ogr \
/data/preprocessed/$state.parquet \
/vsizip//data/$file \
cd ..
Takes ~1 hour on my computer.
poetry run python --input data/manual-hilbert-shuffle/shuffled.parquet