Skip to content

Latest commit

 

History

History

RedshiftDataWarehouse

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Creating Data Warehouse in Redshift

This repository shows how to create an ETL pipeline for a star schema on Redshift, based on large amount of data that is stored in multiple S3 buckets.

Data description

Raw data (Song and Log Datasets) are stored in S3, partitioned into small JSON files, and our goal is to load it into Redshift cluster. Additionally, schema for reading Log Dataset can be found here schema and it is used during COPY command to extract and process datafiles.

Example COPY command to read multiple JSON files in parallel, is shown below:

COPY {table} from 's3://{bucket_name}'
   CREDENTIALS 'aws_iam_role={role_arn}'
   COMPUPDATE OFF region '{region}' FORMAT AS JSON 's3://{bucket_name}/log_json_path.json'
   TRUNCATECOLUMNS BLANKSASNULL EMPTYASNULL
   TIMEFORMAT as 'epochmillisecs';

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{
    "num_songs": 1,
    "artist_id": "ARJIE2Y1187B994AB7",
    "artist_latitude": null,
    "artist_longitude": null,
    "artist_location": "",
    "artist_name": "Line Renaud",
    "song_id": "ABCPIRU12A6D4FA1E1",
    "title": "Der Kleine Dompfaff",
    "duration": 152.92036,
    "year": 0
}

Log dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Walkthrough

Creating Redshift Cluster

There are multiple option how to create Redshift cluster.

  1. From AWS Console
  2. Using Boto3 library
  3. Using AWS CLI

In this repo, we will use Approach #2 using create_cluster.py

The script will additionally create necessary IAM roles, and repopulate IAM ARN and Cluster Endpoint in configuration file dwh.cfg to simplify later usage.

Creating Tables

To create necessary tables - run create_tables.py.

Loading the data

Run etl.py to issue COPY commands to extract data from S3 and load into Redshift.

All necessary SQL queries are shown in sql_queries.py, be sure to modify it corresponding to your needs.

Cleanup

Important! After you finished, Optionally, uncomment delete_cluster() inside create_cluster.py function to tear down the cluster to avoid high expenses.

Delete cluster via CLI:

You can double check, if cluster is running, by issuing these commands.

Example, assuming cluster has been running in us-west-2 region:

aws redshift describe-clusters --region us-west-2
aws redshift delete-cluster --region us-west-2 --cluster-identifier ${cluster_name} --skip-final-cluster-snapshot

Query Performance

On the graph below you can check the baseline performance of creating Star Schema in Redshift:

Runtime

And the graph:

RuntimeGraph