This repository shows how to create an ETL pipeline for a star schema on Redshift, based on large amount of data that is stored in multiple S3 buckets.
Raw data (Song and Log Datasets) are stored in S3, partitioned into small JSON files, and our goal is to load it into Redshift cluster. Additionally, schema for reading Log Dataset can be found here schema and it is used during COPY command to extract and process datafiles.
Example COPY command to read multiple JSON files in parallel, is shown below:
COPY {table} from 's3://{bucket_name}'
CREDENTIALS 'aws_iam_role={role_arn}'
COMPUPDATE OFF region '{region}' FORMAT AS JSON 's3://{bucket_name}/log_json_path.json'
TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL
TIMEFORMAT as 'epochmillisecs';
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
{
"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "ABCPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0
}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
There are multiple option how to create Redshift cluster.
- From AWS Console
- Using Boto3 library
- Using AWS CLI
In this repo, we will use Approach #2 using create_cluster.py
The script will additionally create necessary IAM roles, and repopulate IAM ARN and Cluster Endpoint in configuration file dwh.cfg
to simplify later usage.
To create necessary tables - run create_tables.py
.
Run etl.py
to issue COPY commands to extract data from S3 and load into Redshift.
All necessary SQL queries are shown in sql_queries.py
, be sure to modify it corresponding to your needs.
Important! After you finished,
Optionally, uncomment delete_cluster()
inside create_cluster.py
function to tear down the cluster to avoid high expenses.
You can double check, if cluster is running, by issuing these commands.
Example, assuming cluster has been running in us-west-2
region:
aws redshift describe-clusters --region us-west-2
aws redshift delete-cluster --region us-west-2 --cluster-identifier ${cluster_name} --skip-final-cluster-snapshot
On the graph below you can check the baseline performance of creating Star Schema in Redshift:
And the graph: