This repository is part of a larger project.
This project was completed as part of a Data Engineering Bootcamp at Le Wagon Paris and presented at Demo Day on November 8, 2024 (View Project Demo Slides).
The objective of this project was to build a complete ETL and machine learning pipeline—from data ingestion to an end-user interface—using tools covered in the bootcamp. Given a four-day timeframe, we leveraged previous bootcamp exercises as a foundation, enabling us to focus on optimizing and studying the performance of the pipeline.
Repositories that are part of the Taxifare Project:
- Taxifare:
A data engineering pipeline that ingests, processes, and stores NYC taxi ride data in cloud storage and a data warehouse.
- Distributed processing with
Spark
, onDataproc
- Job orchestration using
Airflow
. - Cloud storage on
Google Cloud Storage
- Analytical warehouse with
BigQuery
- Distributed processing with
- Taxifare API:
A cloud-deployed API providing a prediction endpoint.
- Built with
FastAPi
andGunicorn
- Deployed on
Google Cloud Run
, using aDocker
image hosted inArtifact Registery
- Built with
- Taxifare Front:
A
Streamlit
application that allows users to predict taxi fares with our model.
This FastAPI application provides taxi fare predictions based on ride parameters. The application uses a machine learning model stored in Google Cloud Storage (GCS) and can be deployed both locally and on Google Cloud Run.
The api flow
- Features
- Prerequisites
- Makefile Commands
- Environment Setup
- Quickstart Local Development
- Production Deployment to Cloud Run
- API Endpoints
- Project Structure
- Testing and Logging
- Troubleshooting
- License
- FastAPI Framework: High-performance, easy-to-use API with automatic documentation.
- Prediction endpoint: Predicts taxi fares based on inputs like pickup location, dropoff location, and time.
- Deployment-Ready: Configured for production deployment on Google Cloud Run with Docker.
- Python 3.8 or higher
- Poetry
- Docker
- Google Cloud SDK (includes gcloud CLI) - needs separate installation from https://cloud.google.com/sdk/docs/install
- Required for deployment commands, not for local development
- A Google Cloud Project
We will enable the following services:
- Cloud Run
- Artifact Registry
- GCP service account key which needs to include the following IAM roles:
- Google Cloud Storage:
roles/storage.objectViewer
- Artifact Registry:
roles/artifactregistry.repoAdmin
- Cloud Run:
roles/run.admin
roles/iam.serviceAccountUser
- Google Cloud Storage:
The project includes a Makefile
with useful commands for local development, Docker usage, and Google Cloud deployment.
The commands mentioned in this README are part of the Makefile. Refer to the Makefile to see the full list of commands and additional options to support your development and deployment workflow.
- Clone the repository
git clone https://github.com/Arivima/taxifare-fastapi.git
cd taxifare-fastapi
-
Place your GCP service account key JSON file in the root directory
-
Environment variables
Create a .env
file in the root directory from the .env.sample
cp .env.sample .env
Modify the variables to fit your GCP project and cloud storage bucket
# GCP PROJECT
GCP_PROJECT_ID="your-gcp-project-id"
GCS_BUCKET_NAME="your-gcs-bucket-name"
REGION="region"
# DOCKER LOCAL
DOCKER_IMAGE_NAME="your-docker-image-name"
PATH_SERVICE_ACCOUNT_KEY="path-to-your-service-account-key.json"
# ARTIFACT
ARTIFACT_REPO_NAME="your-artifact-repo-name"
# CLOUD RUN
PACKAGE_NAME="your-package-name"
Note on Environment Variables:
This project requires sensitive configuration details (like GCP credentials) stored in a .env
file and necessary to interact with GCP. The environment variables are loaded automatically using python-dotenv
(which is installed automatically by poetry
). This setup ensures that configuration data remains secure and manageable.
poetry install
make local_start_dev
This starts the server in development mode with hot reload at http://localhost:8080
make local_start_prod
This simulates the production environment using Gunicorn.
- Open http://127.0.0.1:8080 to view the root endpoint.
- View API documentation:
- Swagger UI: http://127.0.0.1:8080/docs
- ReDoc: http://127.0.0.1:8080/redoc
- Build the Docker image:
make local_docker_build
- Run the container:
make local_docker_run
This runs the container using the service account key
To inspect the container:
make local_docker_run_detached
Check GCP settings:
make gcloud_check_config
Configure GCP settings:
make gcloud_set_auth
make gcloud_set_project
Check if required services are enabled
make gcloud_check_enabled_services
Enable required services:
make gcloud_enable_services
Create repository (will check if it does not exists already)
make create_artifact_repo
Configure authentication
make gcloud_set_artifact_repo
make authenticate_docker_to_artifact
If there is an existing docker image, rename it for cloud
make cloud_docker_rename_for_artifact
or build from scratch for cloud
make cloud_docker_build
Then, push to Artifact Registry
make cloud_docker_push_to_artifact
for debug => Listing and deleting images in Artifact :
# List repo / images / files in the artifact repo
cloud_artifact_list_repo
cloud_artifact_list_images
cloud_artifact_list_files
# Delete the image from the artifact registry
cloud_artifact_delete_image
# Delete the cached layers from the artifact registry
cloud_artifact_delete_files
Deploy the application, includes the env variables
make cloud_run
Set up IAM permissions
make cloud_run_set_permissions
make check_deployment
- GET
/
: Health check endpoint - Response:
{"status": "ok"}
This endpoint was set-up to force-reload the model before the demo day to prevent latency
- GET
/model_reload
: Reloads the model from GCS - Response:
{"status": "reloaded"}
or error message
- GET
/predict
: Get fare prediction - Parameters:
pickup_datetime
: string (format: "YYYY-MM-DD HH:MM:SS")pickup_longitude
: floatpickup_latitude
: floatdropoff_longitude
: floatdropoff_latitude
: floatpassenger_count
: integer
- Response:
{"fare": float}
Example request:
http://localhost:8080/predict?pickup_datetime=2014-07-06 19:18:00&pickup_longitude=-73.950655&pickup_latitude=40.783282&dropoff_longitude=-73.984365&dropoff_latitude=40.769802&passenger_count=2
.
├── app/
│ ├── config.py # Configuration settings
│ ├── logging.py # Logging configuration
│ ├── main.py # FastAPI application
│ └── utils/
│ └── gcp.py # GCP utilities
├── tests/ # Test files
├── .dockerignore # Files to ignore from docker image
├── .env # your env variables to set-up
├── .env.sample # .env template with placeholder values
├── Dockerfile # Docker configuration
├── Makefile # Development and deployment commands
├── pyproject.toml # Project dependencies
├── README.md
└── service-account-key.json # your gcp project service account key to copy
poetry run pytest
The application logs are written to:
- Console output
app.log
file
-
If the model fails to load, check:
- GCP credentials are properly set
- Bucket permissions are correct
- Model file exists in the specified GCS path
-
For deployment issues:
- Verify GCP service account permissions
- Check Cloud Run logs for detailed error messages
- Ensure all required environment variables are set
This project is licensed under the MIT License - see the LICENSE file for details.