Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
data		data
notebooks		notebooks
README.md		README.md
create_cluster.py		create_cluster.py
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

README.md

Creating Data Warehouse in Redshift

This repository shows how to create an ETL pipeline for a star schema on Redshift, based on large amount of data that is stored in multiple S3 buckets.

Data description

Raw data (Song and Log Datasets) are stored in S3, partitioned into small JSON files, and our goal is to load it into Redshift cluster. Additionally, schema for reading Log Dataset can be found here schema and it is used during COPY command to extract and process datafiles.

Example COPY command to read multiple JSON files in parallel, is shown below:

COPY {table} from 's3://{bucket_name}'
   CREDENTIALS 'aws_iam_role={role_arn}'
   COMPUPDATE OFF region '{region}' FORMAT AS JSON 's3://{bucket_name}/log_json_path.json'
   TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL
   TIMEFORMAT as 'epochmillisecs';

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{
    "num_songs": 1,
    "artist_id": "ARJIE2Y1187B994AB7",
    "artist_latitude": null,
    "artist_longitude": null,
    "artist_location": "",
    "artist_name": "Line Renaud",
    "song_id": "ABCPIRU12A6D4FA1E1",
    "title": "Der Kleine Dompfaff",
    "duration": 152.92036,
    "year": 0
}

Log dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Walkthrough

Creating Redshift Cluster

There are multiple option how to create Redshift cluster.

From AWS Console
Using Boto3 library
Using AWS CLI

In this repo, we will use Approach #2 using create_cluster.py

The script will additionally create necessary IAM roles, and repopulate IAM ARN and Cluster Endpoint in configuration file dwh.cfg to simplify later usage.

Creating Tables

To create necessary tables - run create_tables.py.

Loading the data

Run etl.py to issue COPY commands to extract data from S3 and load into Redshift.

All necessary SQL queries are shown in sql_queries.py, be sure to modify it corresponding to your needs.

Cleanup

Important! After you finished, Optionally, uncomment delete_cluster() inside create_cluster.py function to tear down the cluster to avoid high expenses.

Delete cluster via CLI:

You can double check, if cluster is running, by issuing these commands.

Example, assuming cluster has been running in us-west-2 region:

aws redshift describe-clusters --region us-west-2
aws redshift delete-cluster --region us-west-2 --cluster-identifier ${cluster_name} --skip-final-cluster-snapshot

Query Performance

On the graph below you can check the baseline performance of creating Star Schema in Redshift:

And the graph:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RedshiftDataWarehouse

RedshiftDataWarehouse

README.md

Creating Data Warehouse in Redshift

Data description

Song Dataset

Log dataset

Walkthrough

Creating Redshift Cluster

Creating Tables

Loading the data

Cleanup

Delete cluster via CLI:

Query Performance

Files

RedshiftDataWarehouse

Directory actions

More options

Directory actions

More options

Latest commit

History

RedshiftDataWarehouse

Folders and files

parent directory

README.md

Creating Data Warehouse in Redshift

Data description

Song Dataset

Log dataset

Walkthrough

Creating Redshift Cluster

Creating Tables

Loading the data

Cleanup

Delete cluster via CLI:

Query Performance