pipeline-oriented-analytics

This is a tutorial demonstrating pipeline-oriented data analytics approach applied to taxi trip duration data. This project should NOT be viewed as an example how to solve a particular regression problem. It is rather a demonstration how to organize computation when solving data analytics prblems. While solving the toy problem some features were introduced artificially just for demo purposes.

The proposed approach is described in the following articles:

Prerequisites

Getting started

Run make init test to initialize the conda environment and to launch the tests
(Optional, sample datasets are available) Download complete train and test datasets from Kaggle's New York City Trip Duration, extract them and overwrite train.csv, test.csv in data/raw folder.

Running examples

run make distance_matrix to generate distance matrix
run make prepare_train features_train train to pre-process train data, extract train features and train
run make prepare_test features_test predict to pre-process test data, extract test features and predict
run make select_params to run hyper-parameter tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
model		model
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pipeline-oriented-analytics

Prerequisites

Getting started

Running examples

About

Releases

Packages

Languages

License

bbiletskyy/pipeline-oriented-analytics

Folders and files

Latest commit

History

Repository files navigation

pipeline-oriented-analytics

Prerequisites

Getting started

Running examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages