- You probably haven't learned everything you need to from your business data.
- Dashboards and SQL display results; they are not adequate analysis tools.
- Machine learning isn't just for making predictions; you can use it for investigation.
- What am I missing? is the most important question in data.
- Investigators must over-measure then distill with ML in order to miss less.
- Statistical testing (A/B/n) is important to assess an idea, but we need to generate better ideas.
- ML-powered exploration, by taking more into account, can lead to deeper, more complete theories about what is happening.
- See notebook 3 for an example: Latent dimensions: The most powerful analysis nobody does.
This project uses a simple Linux GPU setup to model stock market data. It uses the Anaconda Python distribution and data from the Sharadar Core US Equities Bundle on the Nasdaq data link. Environmental variables NASDAQ_DATA_API_KEY
and DATA_HOME
are expected.
The download.py
script fetches the tables, stores them in parquet files, then loads them into a duckdb file, all within $DATA_HOME/analytics_demo
. The dbt project, located in dbt_sharadar_demo, must then be run for preprocessing. Lineage graph:
The notebooks:
- 1_prep.ipynb prepares the target and feature data for notebooks 2 and 3.
- 2_prediction.ipynb gives a classic ML use case for prediction.
- 3_understanding.ipynb demonstrates the main purpose of this repo. It gives a simple, powerful example of how to use ML to build understanding.
- 7_business_credit_risk_proxy.ipynb is a more stand-alone notebook about a fun industry-specific problem.
- Create anaconda environment then activate.
conda env create -f environment.yml --solver=libmamba
conda activate analytics_demo
- Set up dbt config, typically found at
~/.dbt/profiles.yml
, to include the database filepath.
dbt_sharadar_demo:
outputs:
prod:
type: duckdb
path: "{{ env_var('DATA_HOME') }}/analytics_demo/sharadar.duckdb"
threads: 2
target: prod
- Fetch tables.
python3 download.py
- Move to the dbt folder to run dbt.
cd dbt_sharadar_demo
dbt run
cd ..
- Launch Jupyter to host notebooks if you prefer this over an IDE.
jupyter lab
To remove the anaconda environment, simply run:
conda env remove --name analytics_demo
See here if you also wish to remove Anaconda.