The code behind these case studies is intended as a communication tool for the ideas expressed in the book. This code is as far from production code as it can get. Moreover, many of the techniques presented are available as part of some of the underlining libraries used. Showing a step-by-step implementation of different feature engineering techniques is intended.
See the companion website for details about the book: http://artoffeatureengineering.com/.
The style used in Python is also intentionally kept simple for people coming from other languages that plan to use the ideas described in the book outside of Python.
As described in the book, the case studies explore as many feature engineering ideas within the limits of:
- At most one day of execution time per notebook.
- No GPU required.
- Minimal dependencies.
- At most 8Gb of RAM.
As a result of these constraints, these notebooks do not undergo as much hyperparameter tuning as necessary. This is a shortcoming of these case studies, keep it in mind if you want to follow a similar path with your experiments.
Minor issues:
As these case studies are foremost an educational tool, I expect readers might want to try variants of some cells in isolation. To help with that, I have tried for the cells to be executable without having to re-run the whole notebook. That means that most cells read everything they need from disk and write all their results back into disk. This is unnecessary with normal notebooks as the values remain in memory, so the code for each cell might look long and somewhat unusual. In a sense, each cell tries to be a separate Python program. To solidify the vision of independent tweaking, I am also distributing these intermediate files besides the input data.
I dislike Pandas with a passion and discourage its use at any level. These notebooks are Pandas-free, which might seem unusual to some.
The last topic in the last chapter (recommendation as imputation) uses more than 8Gb of RAM.
The data is available in both Zip and Tar BZip2 files. Chapter 9 (images) uses a tile set provided by NASA. It contains 88 thousand tiles occupying 6Gb of space. These tiles are used at the beginning of Chapter 9's notebook to generate 80 thousand boxes around each city or town. These boxes occupy less than 1Gb of disk space. As such, I am distributing the boxes and leaving the tiles for a separate download, in the event you might want to experiment with other techniques extracting more data from the original tiles. Otherwise the feature engineering techniques in Chapter 9's notebook should run fine from the extracted boxes.
- (Book Data, Zip format 2.4Gb)[http://artoffeatureengineering.com/data/feateng_data.zip]
- (Book Data, Tar BZip2 format 1.8Gb)[http://artoffeatureengineering.com/data/feateng_data.bz2]
- (Tile Data, Zip format 5.4Gb)[http://artoffeatureengineering.com/data/feateng_tiles.zip]
- (Tile Data, Tar BZip2 format 5.3Gb)[http://artoffeatureengineering.com/data/feateng_tiles.bz2]
python3 -m venv feateng source ./feateng/bin/activate
sudo apt install python-pydot python-pydot-ng graphviz
pip3 install jupyter pip3 install ipykernel pip3 install scikit-learn pip3 install lxml pip3 install numpy pip3 install scipy pip3 install matplotlib pip3 install graphviz
pip3 install statsmodels
pip3 install stemming pip3 install gensim
pip3 install opencv-python
pip3 install geopy
python -m ipykernel install --user --name feateng jupyter notebook --no-browser .
The folder tourism
contains a case study for the feature engineering chapter in the book Applied Data Science in Tourism: Interdisciplinary Approches, Methodologies and Applications. It uses pyspark to solve an AirBnB price prediction task.
An extension and improvment for the case studies in Chapter 10 is available in the repository for the RIIAA'20 Workshop "Feature Engineering for Spatial and Temporal Data ".