A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.
They have decided to bring you into the project and expect you to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.
The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.
- Log data:
s3://udacity-dend/log_data
- Song data:
s3://udacity-dend/song_data
- Python, SQL
- Apache Airflow
- Amazon Web Services (AWS IAM, S3, Redshift Serverless, CLI)
-
Create a project S3 bucket using the AWS Cloudshell
aws s3 mb s3://dend-phoebe/
-
Copy the data from the udacity bucket to the home cloudshell directory:
aws s3 cp s3://udacity-dend/log-data/ ~/log-data/ --recursive aws s3 cp s3://udacity-dend/song-data/ ~/song-data/ --recursive
-
Copy the data from the udacity bucket to the home cloudshell directory:
aws s3 cp ~/log-data/ s3://dend-phoebe/log-data/ --recursive aws s3 cp ~/song-data/ s3://dend-phoebe/song-data/ --recursive
-
List the data in bucket to be sure it copied over:
aws s3 ls s3://dend-phoebe/log-data/ aws s3 ls s3://dend-phoebe/song-data/
-
Begin_execution
&Stop_execution
Dummy operators representing DAG start and end point
-
Create_tables
Create tables in Redshift
-
Begin_execution
&Stop_execution
Dummy operators representing DAG start and end point
-
Stage_events
&Stage_songs
Extract and Load data from S3 to Amazon Redshift
-
Load_songplays_fact_table
&Load_*_dim_table
Load and Transform data from staging to fact and dimension tables
-
Run_data_quality_checks
Run data quality checks to ensure no empty tables
- Create S3 bucket and copy data from source
- Set Up AWS and Airflow Configurations
- Run
create_tables
DAG to create tables in Redshift - Run
finaly_project
DAG to trigger the ETL data pipeline
NOTE: Set
start_date
=datetime(2018, 11, 1)
andend_date
=datetime(2018, 11, 2)
to test the data pipeline
Permissions - attach exiting policies:
- Administrator Access
- AmazonRedshiftFullAccess
- AmazonS3Full Access
-
Create a Redshift Role
my-redshift-service-role
from the AWS Cloudshell -
Give the role S3 Full Access
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name my-redshift-service-role
-
Setup AWS Redshift Serverless
- Copy and store the Redshift Workgroup endpoint locally, we will need this while configuring Airflow (
redshift
connection)
- Copy and store the Redshift Workgroup endpoint locally, we will need this while configuring Airflow (
Add Airflow Connections:
- Connection ID:
aws_credentails
, Connetion Type:Amazon Web Services
- Connection ID:
redshift
, Connetion Type:Amazon Redshift
Key
=s3_bucket
Value
=dend-phoebe
(bucket name)