This is a tutorial demonstrating pipeline-oriented data analytics approach applied to taxi trip duration data. This project should NOT be viewed as an example how to solve a particular regression problem. It is rather a demonstration how to organize computation when solving data analytics prblems. While solving the toy problem some features were introduced artificially just for demo purposes.
The proposed approach is described in the following articles:
- Pipeline-Oriented Data Analytics with Spark ML
- Pipeline-Oriented Data Analytics with Spark ML. Part 2
- Run
make init test
to initialize the conda environment and to launch the tests - (Optional, sample datasets are available) Download complete train and test datasets from Kaggle's New York City Trip Duration, extract them and overwrite
train.csv
,test.csv
indata/raw
folder.
- run
make distance_matrix
to generate distance matrix - run
make prepare_train features_train train
to pre-process train data, extract train features and train - run
make prepare_test features_test predict
to pre-process test data, extract test features and predict - run
make select_params
to run hyper-parameter tuning.