This is a personal MLOps project based on a BigQuery dataset for taxi ride prices in Chicago. Below you can find some instructions to understand the project content. Feel free to ⭐ and clone this repo 😉
The project has been structured with the following folders and files:
-
images:
images from results -
notebooks:
EDA and Modelling performed at the beginning of the project to establish a baseline -
src:
source code. It is divided into:api
: FastApi app codeinterface
: main workflowsml_logic
: data/preprocessing/modelling functions
-
requirements.txt:
project requirements -
Dockerfile
: docker image for deployment -
.env.sample
: sample file of environmental variables -
.env.sample.yaml
: sample file of environmental variables for Dockerfile deployment
The dataset was obtained from BigQuery and contains 200 million rows and various columns from which the following were selected for this project: prices, pick-up and drop-off locations, and timestamps. To prepare the data for modelling, an Exploratory Data Analysis was conducted to preprocess time and distance features, and suitable scalers and encoders were chosen for the preprocessing pipeline.
The following two charts show the fare distribution of the rides. As the number of rows is too big an environmental variable (DATA_SIZE
) was set up to decide how many rows to query. However, the price distribution for the first 1 million rows shows a big concentration in the first 100 USD.
In order to detect outliers, the z-score
is calculated for each query, so that the outliers are removed depending on the number of rows downloaded. The following chart represents the fare distribution after removing outliers.
For the distance preprocessing, the first approach was to plot the pickup and drop-off locations on a map and histogram (excluding outliers), to see the distribution.
It can be seen that the distance distribution is heavily concentrated in the first 10 km to 50 km. The preprocessing approach was to calculate the Manhattan
and Haversine
distance for each ride and encode it.
For the time preprocessing, the idea was to extract the hour/day/month and separate features and encode them. The hours were previously divided into sine and cosine.
Subsequently, a Neural Network Model was performed with several Dense, BatchNormalization and Dropout layers. The results showed a MAE of around 3 USD from an average price of 20 USD. However, the price prediction for rides above 10 USD shows a higher accuracy compared to rides up to 10 USD.
Afterwards, the models underwent model registry, and deployment using MLflow, Prefect, and FasApi. The Dockerimage was pushed to a Google Container Registry and deployed in Google Cloud Run.
In order to train a model, the file main.py
in the src/interface folder must be run. This will log the models in MLflow and allow registration and model transition from None to Staging and Production stages. These options can be set up in the file registry.py
in the src/ml_logic folder. Additionally, the environmental variable MODEL_TARGET
must be set either to local or gcs, so that the model is saved either locally or in a GCS Bucket.
Once a model is saved/registered, the workflow.py
file in the src/interface folder allows a Prefect workflow to predict new data with the saved model and train a new model with these data to compare the results. If the MAE of the new model is lower, this model can be sent to the production stage and the old model will be archived.
To run Prefect and MLflow the following commands must be run in the terminal from the src/interface directory, to see the logs:
- MLFlow:
mlflow ui --backend-store-uri sqlite:///mlflow.db
- Prefect Cloud (with own account):
prefect cloud login
-
Prefect (locally):
prefect server prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api
Having a model saved and in production, the fast.py
file can be run to get a prediction. This can be done either locally by running a prediction API, building a Dockerfile, or pushing the Dockerfile to a Docker container in Google Cloud Run to get a service URL.
To run the prediction API run this from the project root directory and check the results here http://127.0.0.1:8000/predict:
uvicorn src.api.fast:app --reload
To run the Dockerimage build it and check the results here http://127.0.0.1:8000/predict:
docker build --tag=image .
docker run -it -e PORT=8000 -p 8000:8000 --env-file your/path/to/.env image
To get a service URL, first build the image:
docker build --tag=image .
Then push our image to Google Container Registry:
docker push image
Finally, deploy it and get the Service URL in the terminal to run predictions on your own website. You should get something like this: Service URL: https://yourimage-jdhsk768sdfd-rt.a.run.app
gcloud run deploy --image image --region your-gcp-region --env-vars-file .env.yaml