This repository contains four Data Engineering projects created by Ting Lu, using AWS to build ETL/ELT pipelines and conduct big data analysis to address various business requirements (link as follows):
-
Data Modeling with Apache Cassandra: Build an ETL pipeline using Python driver from a directory of CSV files to an Apache Cassandra NoSQL database to improved efficiency in querying user activity data.
-
Cloud Data Warehouse & ELT Pipeline:Build an ETL pipeline that extracts JSON logs and metadata from S3, loads them into AWS Redshift staging tables, and transforms the data into a Star Schema Database with dimensional tables for marketing and analytics teams to query song play insights.
-
STEDI Human Balance Analytics- Data Lakehouse solution: Construct a lakehouse solution with landing, trusted, and curated data lake zones in AWS, utilizing Spark, Python, Glue Studio, S3, and Athena to address the STEDI data scientists' requirements.
-
Automatic Data Pipeline with Apache Airflow: Design, automate and monitor ETL pipelines in Apache Airflow for processing JSON logs and metadata from AWS S3 into Redshift data warehouse, involving custom operators for staging, data loading, and data quality checks, to create versatile ETL pipelines with monitoring and backfill capabilities.
Apache Airflow, Apache Spark,Python,PostgreSQL, Apache Cassandra, NoSQL, Data Warehouse, Data Lakehouse, AWS S3, Redshift, Athena, Glue Studio,Database, Schema, ETL & ELT pipeline, Data Modeling, Big Data