Welcome to the repository hosting code for the end-to-end data analysis workflows in python workshop!
Author: Katie Malone, data scientist @ Civis Analytics
Getting started: There are two ipython notebooks containing all the relevant code:
- african_wells.ipynb (starter code, and some relevant explanations/links)
- african_wells_solutions.ipynb
The first has prompts only and will hopefully be the only one you need. If you get stuck, or you are going through this workshop asynchronously, the second one might be a useful reference.
In order to work through this example, you will need the training and testing data associated with the Pump it Up: Data Mining the Water Table hosted on drivendata.org. In the notebooks, we call the training feature and labels files wells_features.csv
and wells_labels.csv
, respectively.
Software requirements are as follows:
- python 3 (2.x might work with minimal changes, but no guarantees)
- ipython
- pandas
- numpy
- scipy
- sklearn
If you are starting from scratch, and have none of the above installed, consider getting them via Anaconda, which includes all of the above (and more!) in an easy-to-use bundle.
Last, as you get toward the bottom of the notebook and GridSearchCV, you may want to consider porting your notebook workflow into a python script that can be run via your terminal. I have anecdotally found that some commands were running very slowly in the notebook, but faster when put in a script.
When you've made a workflow that you're satisfied with, I strongly suggest that you submit it to drivendata.org, and get involved in the competition!