This repo is to centralize the batch data pipelines (ELT) built on Apache Airflow that I have implemented in different personal projects.
-
Side note: the tools and technologies employed in this project are intended for exploration rather than optimized for a specific use case. As a result, they may not represent the most suitable architectures for particular scenarios.
Up to now there are 3 main projects:
-
Project 1: Data ingestion and transformation with Youtube trending dataset:
- Load JSON and CSV files from a local server to Azure Blob Storage.
- Ingest them into Snowflake where transformation steps were performed.
- DAG location:
- dags/youtube_trend_init.py
- dags/youtube_trend.py
-
Project 2: Data pre-processing for ML model with nyc_taxi dataset using Spark:
- Load parquet files from a local server to Google Cloud Storage.
- Clean this big dataset (~730 million records) using SparkSQL on Cloud Dataproc and load the cleaned data back to GCS.
- DAG location: dags/nyc_taxi.py
-
Project 3: Data ingestion and transformation with Airbnb dataset using dbt, Astronomer Cosmos and Airflow: There are 2 phases in this project:
- Phase 1: dbt on Postgres database.
- Workflow:
- Ingest data directly into Data warehouse (Postgres SQL hosted on Google Cloud SQL).
- Transform the ingested data in Postgres SQL (hosted on Google Cloud SQL) using dbt.
- Orchestrate dbt models using Airflow via Astronomer Cosmos.
- Limitations:
- Do not store raw data in the staging area, thus do not support complex transformations, do not support error handling and auditing nor data versioning and rolling back.
- Postgres database is a good fit for certain transactional workloads instead of analytics operations.
- Code:
- DAG location: dags/airbnb/airbnb_postgres_dbt_cosmos.py
- dbt models: dags/dbt/airbnb/airbnb_postgres
- Workflow:
- ⭐ Phase 2: dbt on BigQuery
- To address the limitations in phase 1, the workflow is modified:
- Load data and their schema from local to Google Cloud Storage (stored in hive partitions for better retrieval).
- Ingest data from GCS into BigQuery.
- Transform the ingested data using dbt.
- Orchestrate dbt models using Airflow via Astronomer Cosmos.
- Limitations & TO-DO:
- Ingestion: truncating the whole table instead of merging -> TO-DO: implement a more robust ingesting workflow: merging and deduplicating data.
- Code:
- DAG location: dags/airbnb/airbnb_bigquery_dbt_cosmos.py
- dbt models: dags/dbt/airbnb_bigquery
- Example of the DAG:
- To address the limitations in phase 1, the workflow is modified:
- Phase 1: dbt on Postgres database.