Skip to content

Data Modeling, Airflow, Spark, Apache Cassandra, Data Warehouse, Data Lakehouse, AWS S3, AWS Redshift, AWS Athena, AWS Glue Studio

Notifications You must be signed in to change notification settings

Ting-DS/Data-Engineering-with-AWS-Nanodegree

Repository files navigation

Data Engineering with AWS Nanodegree

Introduction

This repository contains four Data Engineering projects created by Ting Lu, using AWS to build ETL/ELT pipelines and conduct big data analysis to address various business requirements (link as follows):

  • Data Modeling with Apache Cassandra: Build an ETL pipeline using Python driver from a directory of CSV files to an Apache Cassandra NoSQL database to improved efficiency in querying user activity data.

  • Cloud Data Warehouse & ELT Pipeline:Build an ETL pipeline that extracts JSON logs and metadata from S3, loads them into AWS Redshift staging tables, and transforms the data into a Star Schema Database with dimensional tables for marketing and analytics teams to query song play insights.

  • STEDI Human Balance Analytics- Data Lakehouse solution: Construct a lakehouse solution with landing, trusted, and curated data lake zones in AWS, utilizing Spark, Python, Glue Studio, S3, and Athena to address the STEDI data scientists' requirements.

  • Automatic Data Pipeline with Apache Airflow: Design, automate and monitor ETL pipelines in Apache Airflow for processing JSON logs and metadata from AWS S3 into Redshift data warehouse, involving custom operators for staging, data loading, and data quality checks, to create versatile ETL pipelines with monitoring and backfill capabilities.

Keywords & Reference:

Apache Airflow, Apache Spark,Python,PostgreSQL, Apache Cassandra, NoSQL, Data Warehouse, Data Lakehouse, AWS S3, Redshift, Athena, Glue Studio,Database, Schema, ETL & ELT pipeline, Data Modeling, Big Data

Some example

Data Warehouse Schema in Redshift for Song Play Analysis

Data Lakehouse Solution for STEDI Human Balance Analytics

Airflow DAG for for User Activities Analysis

About

Data Modeling, Airflow, Spark, Apache Cassandra, Data Warehouse, Data Lakehouse, AWS S3, AWS Redshift, AWS Athena, AWS Glue Studio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published