Introduction to HDF5 in Python.
If you are just curious and want to have a look at the notebook without installing anything, go to http://nbviewer.jupyter.org/ and type jackdbd/hdf5-pydata-munich
in the search bar.
Create a Python 3.5 virtual environment. It seems that at this moment Bokeh has some issues with Python 3.6.
pip install -r requirements.txt
# start the notebook server
jupyter notebook --port 8085
# open your browser and go to:
# http://localhost:8085/notebooks/hdf5_in_python.ipynb
-
Visit the NYC Taxi & Limousine Commission website and download the CSV files from the 2015 Yellow taxi dataset (TLC Trip Record Data). You can also download just one month (e.g. January) to try these snippets out.
-
Place the csv files here:
hdf5-pydata-munich/data/nyctaxi/2015/<your-file-here>.csv
-
Create the HDF5 file which contains all the tables (1 table per month) with:
cd snippets
python create_taxi_table.py
This creates the HDF5 file NYC-yellow-taxis-10k.h5
.
- store a sample of each CSV file in the tables with:
python append_to_taxi_table.py
This reads a chunk of 10000 rows from all the CSV files that you downloaded, then stores the results in the HDF5 file NYC-yellow-taxis-10k.h5
. This is just a small sample of the original dataset. If you want to store the entire dataset (~12 million rows per month), just remove the break
statement in append_to_taxi_table.py
.
To view the structure of the tables you can use a HDF5 viewer like HDFView, HDF Compass or ViTables.
If you want to play around with a huge HDF5 file, I created a snippet that generates some synthetic data. You can run it with:
python create_huge_hdf5_file.py
This takes roughly 5 minutes to run and creates the HDF5 file pytables-clinical-study.h5
which should be around 5GB in size. You can tweak the code just a little bit to create even bigger files.