Skip to content

phamthiminhtu/batch-data-pipeline

Repository files navigation

batch-data-pipeline

This repo is to centralize the batch data pipelines (ELT) built on Apache Airflow that I have implemented in different personal projects.

  • Side note: the tools and technologies employed in this project are intended for exploration rather than optimized for a specific use case. As a result, they may not represent the most suitable architectures for particular scenarios.

  • Implemented pipelines are shown in the image below: batch-data-pipeline

Project overview:

Up to now there are 3 main projects:

  • Project 1: Data ingestion and transformation with Youtube trending dataset:

    • Load JSON and CSV files from a local server to Azure Blob Storage.
    • Ingest them into Snowflake where transformation steps were performed.
    • DAG location:
      • dags/youtube_trend_init.py
      • dags/youtube_trend.py
  • Project 2: Data pre-processing for ML model with nyc_taxi dataset using Spark:

    • Load parquet files from a local server to Google Cloud Storage.
    • Clean this big dataset (~730 million records) using SparkSQL on Cloud Dataproc and load the cleaned data back to GCS.
    • DAG location: dags/nyc_taxi.py
  • Project 3: Data ingestion and transformation with Airbnb dataset using dbt, Astronomer Cosmos and Airflow: There are 2 phases in this project:

    • Phase 1: dbt on Postgres database.
      • Workflow:
        • Ingest data directly into Data warehouse (Postgres SQL hosted on Google Cloud SQL).
        • Transform the ingested data in Postgres SQL (hosted on Google Cloud SQL) using dbt.
        • Orchestrate dbt models using Airflow via Astronomer Cosmos.
      • Limitations:
        • Do not store raw data in the staging area, thus do not support complex transformations, do not support error handling and auditing nor data versioning and rolling back.
        • Postgres database is a good fit for certain transactional workloads instead of analytics operations.
      • Code:
    • Phase 2: dbt on BigQuery
      • To address the limitations in phase 1, the workflow is modified:
        • Load data and their schema from local to Google Cloud Storage (stored in hive partitions for better retrieval).
        • Ingest data from GCS into BigQuery.
        • Transform the ingested data using dbt.
        • Orchestrate dbt models using Airflow via Astronomer Cosmos.
      • Limitations & TO-DO:
        • Ingestion: truncating the whole table instead of merging -> TO-DO: implement a more robust ingesting workflow: merging and deduplicating data.
      • Code:
      • Example of the DAG:
Screenshot 2024-01-22 at 22 49 00

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published