This is a personal MLOps project based on a Kaggle dataset for credit default predictions.
It was developed as part of the this End-to-end MLOps with Databricks course and you can walk through it together with this Medium publication.
Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
.github/workflows
: CI/CD configuration filescd.yml
ci.yml
data
: raw datadata.csv
notebooks
: notebooks for various stages of the projectcreate_source_data
: notebook for generating synthetic datacreate_source_data_notebook.py
feature_engineering
: feature engineering and MLflow experimentsbasic_mlflow_experiment_notebook.py
combined_mlflow_experiment_notebook.py
custom_mlflow_experiment_notebook.py
prepare_data_notebook.py
model_feature_serving
: notebooks for serving models and featuresAB_test_model_serving_notebbok.py
feature_serving_notebook.py
model_serving_feat_lookup_notebook.py
model_serving_notebook.py
monitoring
: monitoring and alerts setupcreate_alert.py
create_inference_data.py
lakehouse_monitoring.py
send_request_to_endpoint.py
src
: source code for the projectcredit_default
data_cleaning.py
data_cleaning_spark.py
data_preprocessing.py
data_preprocessing_spark.py
utils.py
tests
: unit tests for the projecttest_data_cleaning.py
test_data_preprocessor.py
workflows
: workflows for Databricks asset bundledeploy_model.py
evaluate_model.py
preprocess.py
refresh_monitor.py
train_model.py
.pre-commit-config.yaml
: configuration for pre-commit hooksMakefile
: helper commands for installing requirements, formatting, testing, linting, and cleaningproject_config.yml
: configuration settings for the projectdatabricks.yml
: Databricks asset bundle configurationbundle_monitoring.yml
: monitoring settings for Databricks asset bundle
The Python version used for this project is Python 3.11.
-
Clone the repo:
git clone https://github.com/benitomartin/mlops-databricks-credit-default.git
-
Create the virtual environment using
uv
with Python version 3.11 and install the requirements:uv venv -p 3.11.0 .venv source .venv/bin/activate uv pip install -r pyproject.toml --all-extras uv lock
-
Build the wheel package:
# Build uv build
-
Install the Databricks extension for VS Code and Databricks CLI:
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
-
Authenticate on Databricks:
# Authentication databricks auth login --configure-cluster --host <workspace-url> # Profiles databricks auth profiles cat ~/.databrickscfg
After entering your information, the CLI will prompt you to save it under a Databricks configuration profile ~/.databrickscfg
Once the project is set up, you need to create the volumes to store the data and the wheel package that will you have to install in the cluster:
-
catalog name: credit
-
schema_name: default
-
volume name: data and packages
# Create volumes databricks volumes create credit default data MANAGED databricks volumes create credit default packages MANAGED # Push volumes databricks fs cp data/data.csv dbfs:/Volumes/credit/default/data/data.csv databricks fs cp dist/credit_default_databricks-0.0.1-py3-none-any.whl dbfs:/Volumes/credit/default/packages # Show volumes databricks fs ls dbfs:/Volumes/credit/default/data databricks fs ls dbfs:/Volumes/credit/default/packages
Some project files require a Databricks authentication token. This token allows secure access to Databricks resources and APIs:
-
Create a token in the Databricks UI:
-
Navigate to
Settings
-->User
-->Developer
-->Access tokens
-
Generate a new personal access token
-
-
Create a secret scope for securely storing the token:
# Create Scope databricks secrets create-scope secret-scope # Add secret after running command databricks secrets put-secret secret-scope databricks-token # List secrets databricks secrets list-secrets secret-scope
Note: For GitHub Actions (in cd.yml
), the token must also be added as a GitHub Secret in your repository settings.
Now you can follow the code along the Medium publication or use it as supporting material if you enroll in the course. The blog does not contain an explanation of all files. Just the main ones used for the final deployment, but you can test out other files as well 🙂.