This repository is made to show one of the ways how to detect the data drift when using Valohai. Here, we use WhyLabs to generate drift reports for the input image data. We also show how to automaticaly trigger the retraining of the model, how to use human approval to the step in the pipeline, how to use Valohai actions.
Drift in machine learning refers to a change over time in input data or the relationship between input and output data, impacting model performance. Drift can lead to reduced model accuracy as the model becomes less effective over time, necessitating regular updates.
Basic Pipeline that preprocess the data, trains the model and evaluates the results.
Consists of three steps:
- Data Preprocessing:
- Load compressed data from S3 bucket
- Preprocess images
- Save to Valohai datasets
- YOLO finetuning:
- Create yaml file with data path - readable to YOLO
- Train yolo using library
ultralytics
- Save best model with Valohai alias.
- Evaluation:
- Load and evaluate the model using
ultralytics
- Save the results to Valohai.
- Load and evaluate the model using
Training Pipeline view in Valohai:
Does the inference of the fine-tuned model and detects the data drift using WhyLabs.
Consists of two steps:
- Drift detection:
- Load the data and the model
- Inference the data
- Log the data to WhyLabs
- Create inference and reference (from train data) profiles
- Generate summary drift report with WhyLabs in html (
summary_drift_report.html
) _Note: We set a threshold on the number of image characteristics showing drift in WhyLabs. Once this threshold is reached, we initiate the training pipeline._ - If drift is detected, change Status detail.
- if drift is not detected, then the pipeline is stopped (Valohai actions docs, see
valohai.yaml
->drift-detection-pipeline
)
- Call retrain
- Only if on the previous step the drift was detected, the node starts.
- When the node is starting it will require human approval (Valohai actions docs)
- You can set up notification when the pipeline requires human approval by going: project
Settings -> Notifications -> pipeline node approval required
.
- You can set up notification when the pipeline requires human approval by going: project
- If approved, API call to start the
Training Pipeline
- retrain the model because the drift was detected.
Drift Detection Pipeline view in Valohai:
Overall flow of the project:
To run your code on Valohai using the terminal, follow these steps:
- Install Valohai on your machine by running the following command:
pip install valohai-cli valohai-utils
- Log in to Valohai from the terminal using the command:
vh login
- Create a project for your Valohai workflow.
Start by creating a directory for your project:
mkdir valohai-drift-example
cd valohai-drift-example
Then, create the Valohai project:
vh project create
- Clone the repository to your local machine:
git clone https://github.com/valohai/drift-example.git .
Congratulations! You have successfully cloned the repository, and you can now modify the code and run it using Valohai.
To run individual steps, execute the following command:
vh execution run <step-name> --adhoc
For example, to run the prepare_data step, use the command:
vh execution run prepare_data --adhoc
To run pipelines, use the following command:
vh pipeline run <pipeline-name> --adhoc
For example, to run the three-trainings-pipeline-w-deployment pipeline, use the command:
vh pipeline run train-val-pipeline --adhoc
In this project you need to use private tokens in two places: to use WhyLabs and to use Valohai API in call-retrain.py
.
Note that you should never include the token in your version control. Instead of pasting it directly into your code, we recommend storing it as a secret environment variable.
You can add environment variables in a couple of ways in Valohai.
- Add the environment variable when creating an execution from the UI (Create Execution -> Environment Variables). The env variable are only available in the execution where it was created.
- Add the project environment variable (Project Settings -> "Environment Variables" tab -> Check "Secret" checkbox). In this case, the env variable will be available for all executions of the project.
WhyLabs is presented here as one of the options to detect the data drift for the image data. Valohai does not have limitations for any other monitoring tools like EvidentlyAI, Fiddler, Censius, NeptuneAI etc.