batch-data-pipeline

This repo is to centralize the batch data pipelines (ELT) built on Apache Airflow that I have implemented in different personal projects.

Side note: the tools and technologies employed in this project are intended for exploration rather than optimized for a specific use case. As a result, they may not represent the most suitable architectures for particular scenarios.
Implemented pipelines are shown in the image below:

Project overview:

Up to now there are 3 main projects:

Project 1: Data ingestion and transformation with Youtube trending dataset:
- Load JSON and CSV files from a local server to Azure Blob Storage.
- Ingest them into Snowflake where transformation steps were performed.
- DAG location:
  - dags/youtube_trend_init.py
  - dags/youtube_trend.py
Project 2: Data pre-processing for ML model with nyc_taxi dataset using Spark:
- Load parquet files from a local server to Google Cloud Storage.
- Clean this big dataset (~730 million records) using SparkSQL on Cloud Dataproc and load the cleaned data back to GCS.
- DAG location: dags/nyc_taxi.py
Project 3: Data ingestion and transformation with Airbnb dataset using dbt, Astronomer Cosmos and Airflow: There are 2 phases in this project:
- Phase 1: dbt on Postgres database.
  - Workflow:
    - Ingest data directly into Data warehouse (Postgres SQL hosted on Google Cloud SQL).
    - Transform the ingested data in Postgres SQL (hosted on Google Cloud SQL) using dbt.
    - Orchestrate dbt models using Airflow via Astronomer Cosmos.
  - Limitations:
    - Do not store raw data in the staging area, thus do not support complex transformations, do not support error handling and auditing nor data versioning and rolling back.
    - Postgres database is a good fit for certain transactional workloads instead of analytics operations.
  - Code:
    - DAG location: dags/airbnb/airbnb_postgres_dbt_cosmos.py
    - dbt models: dags/dbt/airbnb/airbnb_postgres
- ⭐ Phase 2: dbt on BigQuery
  - To address the limitations in phase 1, the workflow is modified:
    - Load data and their schema from local to Google Cloud Storage (stored in hive partitions for better retrieval).
    - Ingest data from GCS into BigQuery.
    - Transform the ingested data using dbt.
    - Orchestrate dbt models using Airflow via Astronomer Cosmos.
  - Limitations & TO-DO:
    - Ingestion: truncating the whole table instead of merging -> TO-DO: implement a more robust ingesting workflow: merging and deduplicating data.
  - Code:
    - DAG location: dags/airbnb/airbnb_bigquery_dbt_cosmos.py
    - dbt models: dags/dbt/airbnb_bigquery
  - Example of the DAG:

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
archive		archive
config/schemas/airbnb		config/schemas/airbnb
dags		dags
plugins		plugins
python_scripts		python_scripts
sql		sql
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose-2.5.yaml		docker-compose-2.5.yaml
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

batch-data-pipeline

Project overview:

About

Releases

Packages

Languages

phamthiminhtu/batch-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

batch-data-pipeline

Project overview:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages