Data Engineering Projects

Data Modeling, Data Pipelines wtih Airflow, Data Lakes, Infrastructure setup on AWS, Data Warehousing, Pipeline Monitoring, and Pipeline Alerts

Project 1: Data Modeling with Postgres

In this project, I applied Data Modeling with Postgres and built an ETL pipeline using Python. A startup wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, the start up is collecting data in H5 format and the analytics team is particularly interested in understanding what songs users are listening to.

Link: Data Modeling with Postgres

Project 2: Data Modeling with Cassandra

In this project, I applied Data Modeling with Cassandra and constructed an ETL pipeline via Python. A Data Model was constructed based on the queries to address the following:

Get details of a song that was heard on the music app history during a particul session
Get songs played by a user during a particular session on the music app
Get all users from the music app history who listened to a particular song

Link: Data Modeling with Cassandra

Project 3: Data Warehouse on AWS

In this project, I constructed a Data Warehouse on AWS and engineered an ETL pipeline to extract and transform data stored in the S3 buckets and migrated data to the Data Warehouse hosted on Amazon Redshift.

Using Redshift IaC script: Redshift IaC README

Link: Data Warehouse

Project 4: Data Lake

In this project, I will build a Data Lake on AWS using Spark and AWS EMR cluster. The data lake will be the single source for the analytics platform. Utilizing spark jobs to perform ELT operations that picks data from the S3 landing zone, then transforming and storing the data to the S3 processed zone.

Link: Data Lake

Project 5: Data Pipelines with Airflow

For this project, a Data Pipeline workflow was created with Apache Airflow. I will schedule ETL jobs to create project related custom plugins and operators to automate the pipeline execution.

Link: Airflow Data Pipeline

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
AWS_Services		AWS_Services
Airflow_Data_Pipelines		Airflow_Data_Pipelines
Data_Lake		Data_Lake
Data_Modeling_with_Cassandra		Data_Modeling_with_Cassandra
Data_Modeling_with_Postgres		Data_Modeling_with_Postgres
Data_Warehouse		Data_Warehouse
EMR		EMR
images		images
.gitignore		.gitignore
Airflow_CloudFormation.yaml		Airflow_CloudFormation.yaml
Airflow_Livy_Config_CloudFormation.md		Airflow_Livy_Config_CloudFormation.md
LICENSE		LICENSE
README.md		README.md
Redshift_Cluster_IaC.py		Redshift_Cluster_IaC.py
Redshift_IaC_README.md		Redshift_IaC_README.md
Redshift_test.py		Redshift_test.py
dwh-iac.cfg		dwh-iac.cfg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Cassandra

Project 3: Data Warehouse on AWS

Project 4: Data Lake

Project 5: Data Pipelines with Airflow

About

Releases

Packages

Languages

License

AuFeld/Data_Engineering_Projects

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Projects

Project 1: Data Modeling with Postgres

Project 2: Data Modeling with Cassandra

Project 3: Data Warehouse on AWS

Project 4: Data Lake

Project 5: Data Pipelines with Airflow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages