This repository demonstrates how to detect data drift in a machine learning pipeline using EvidentlyAI integrated with Valohai. It showcases the steps to preprocess data, train a model, and monitor data drift, with automated retraining triggered upon drift detection.
Data drift in machine learning refers to the change in input data distribution or the relationship between input and output data over time, which can adversely affect model performance. Monitoring and managing drift is crucial to maintaining model accuracy and reliability in production.
This pipeline preprocesses the data and trains the model.
-
Data Preprocessing:
- Load the dataset from Valohai inputs or fetch the California Housing dataset if not available.
- Preprocess the data.
- Save the processed data to Valohai with an alias.
-
Model Training:
- Load the preprocessed data.
- Train the model using scikit-learn.
- Save the trained model with a Valohai alias.
This pipeline performs inference with the fine-tuned model and detects data drift using EvidentlyAI.
-
Inference and Drift Detection:
- Load the reference dataset, current dataset, and the trained model.
- Perform inference on the current dataset.
- Generate data drift reports using EvidentlyAI.
- Save the drift reports in various formats (JSON, HTML).
-
Conditional Retraining:
- Check if drift is detected based on the reports.
- If drift is detected, update the status and trigger the retraining pipeline.
- If no drift is detected, stop the pipeline.
- Data is preprocessed and stored.
- Model is trained and evaluated.
- Inference is performed on new data to detect drift.
- If drift is detected, the pipeline triggers retraining with human approval.
- If no drift is detected, the pipeline stops.
To run this code on Valohai from your terminal, follow these steps:
-
Install Valohai CLI and utilities:
pip install valohai-cli
-
Log in to Valohai from the terminal:
vh login
-
Create a directory for your project and initialize a Valohai project:
mkdir valohai-evidently-example cd valohai-evidently-example vh project create
-
Clone this repository into your project directory:
git clone https://github.com/valohai/evidently-example.git .
To run individual steps:
vh execution run <step-name> --adhoc
Example to run the preprocessing step:
vh execution run preprocess --adhoc
To run the entire pipeline:
vh pipeline run <pipeline-name> --adhoc
Example to run the training pipeline:
vh pipeline run inference-drift-detection-pipeline --adhoc
In this project you need to use private token in to use Valohai API in call-retrain.py
.
Note that you should never include the token in your version control. Instead of pasting it directly into your code, we recommend storing it as a secret environment variable.
You can add environment variables in a couple of ways in Valohai.
- Add the environment variable when creating an execution from the UI (Create Execution -> Environment Variables). The env variable are only available in the execution where it was created.
- Add the project environment variable (Project Settings -> "Environment Variables" tab -> Check "Secret" checkbox). In this case, the env variable will be available for all executions of the project.