(You can work in pairs if you want to.)
In this tutorial, you are going to use DVC and MLflow to set up an existing ML pipeline in a GitHub repository, look at the experiment results, and finally build a CI Pipeline with GitHub Actions. This is the starter repository you should use as the basis for your own GitHub repository.
Prerequisites:
- Install DVC
- Create a GitHub account (if you don't have one already)
- Create a new GitHub repository for your account and initialize the repo with a
README
to allow direct cloning (alternative: fork this starter repo) - Clone the new or forked repo to your local machine
- Install the dependencies (
pip install -r requirements.txt
), which will also install MLflow
Before you start working on the tasks below, it might be a good idea to first check out the quickstart guides for both DVC and MLflow.
Tasks:
- Copy the files from the starter repo to your local repo (not needed if you forked it).
- Initialize DVC in the local repo.
- The training script (
src/train.py
) requires a CSV dataset for the quality of red wine. Use thedvc list
command to display all DVC artifacts in thedata
folder of the following repo: https://github.com/se4ai-lecture/dvc-artifacts - After you've identified the correct CSV file, use
dvc import
to import the remote dataset into your own repo. - Extend the script (
src/train.py
) with the necessary MLflow experiment tracking and model storing code (seeTODO
comments). Two hyperparameters and three evaluation metrics should be tracked. - Execute the updated training script to start the experiments (
python src/train.py
). - Start the MLflow web UI and look at the experiment results. Which hyperparameter configuration leads to the best R2 score?
- Commit and push everything to your GitHub repo (except for the
mlruns
folder and the actual dataset, but those are already excluded via.gitignore
).
Before you start working on the task below, it might be a good idea to first check out the quickstart guide for GitHub Actions.
Setup a CI pipeline using GitHub Actions to automatically run the training script created in part 1 on every push. The Python Starter Workflow might be a good basis that you can adapt. Remember that the CI runner will also need to retrieve the dataset via DVC.