Sample data store project to be hosted on a remote server or cluster. CICD using GitHub actions for SSH Deploy to remote server for docker compose.
- Orchestration: Docker Compose
- Reverse Proxy: Traefik
- ETL/ELT and Scheduling: Airflow
- Blob Storage: Minio
- Database: Postgres
- API: FastAPI
- Web Scraping: Selenium
- Notebooks(
local
|dev
): Jupyter
- Docker and Docker Compose are installed on the host machine.
Checkout ./.setups/PROJECT_SETUP.md
and ./.setups/SERVER_SETUP.md
for local
| dev
| prod
deployments
- Acts as a reverse proxy to handle HTTPS connections and SSL certificates.
- Provides a web dashboard for managing the services at
https://admin.HOSTNAME
. - Automatically fetches and renews SSL certificates using Let's Encrypt.
- The PostgreSQL database service for storing relational data.
- Data is persisted in the
./postgres-data
directory on the host machine.
- The MinIO blob storage service for storing unstructured data.
- The web-based console is available at
https://s3console.HOSTNAME
. - Data is persisted in the
./file-storage
directory on the host machine.
- A FastAPI-based API service for interacting with the data stored in MinIO and Postgres.
- Public and Private routing via API_KEY header
https://fastapi.HOSTNAME
- Apache Airflow for workflow orchestration and task scheduling.
- Provides a web interface for managing and monitoring workflows.
- Custom DAGs can be added in the
./airflow_data/dags
directory.
file-storage
: The volume to store MinIO data.postgres-data
: The volume to store PostgreSQL data.traefik-public-certificates
: The volume to store Traefik SSL certificates.downloads
: The volume for Apache Airflow downloads (e.g., Selenium drivers).
datastore
: The network shared by all services that need to be publicly available via Traefik.
-
Ensure that your DNS records point to your server's IP address to access the services via the defined domain names. To set up on a local server:
- Setup port forwarding for ports 22, 80, 443, and 5432
- Check out DuckDNS for free DNS.
-
The
.env
file should be kept private and never committed to version control, as it contains sensitive information. -
Extend the functionality of the stack by building upon the FastAPI-based API service, Apache Airflow for more complex workflows, and implementing data pipeline executions.
- MLFlow: End-to-End ML management software
- Minikube or K8s: K8s > docker compose just I haven't learned it yet
- React: Front end for analytics or admin dashboard
- Multi Stage Deployments:
local
|feature_branch
->dev
->main
- FastAPI-Traefik-Postgres
- FastAPI-Traefik
- Minio-Traefik
- Postgres-SSL
- Traefik
- Apache Airflow
- Data Pipelines with Apache Airflow
- SSH Server
- GitHub Actions
This data store project is a powerful toolset for managing data workflows and is ready for further customization and expansion to suit specific data science needs.