Sparkify Song Analysis Spark ETL Pipeline

Goals

Create an ETL pipeline that extracts data from S3, and transforms data into a set of dimensional tables, and load them into another S3 bucket for the analytics team to continue finding insights into what songs and artists their users are listening to.

What time of day is an artist/song typically listened to?
What are paid users listening to the most?
What is the top listened to artist/song by location?
What is an artist's most listened to song?
What is an artist's total listens across all of their songs?

Schema

Data is dimensionalised into the users, songs, artists, and time tables and the required Fact table 'songplays' is created on which the team will run their queries. Analysis able to be done initially on the IDs in the fact table for performance benefits.

Fact Table

1. **songplays** - records in log data associated with song plays i.e. records with page `NextSong`
    - *songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

Dimension Tables

1. **users** - users in the app
    - *user_id, first_name, last_name, gender, level*
2. **songs** - songs in music database
    - *song_id, title, artist_id, year, duration*
3. **artists** - artists in music database
    - *artist_id, name, location, latitude, longitude*
4. **time** - timestamps of records in **songplays** broken down into specific units
    - *start_time, hour, day, week, month, year, weekday*

Pipeline Steps

Song and event log data is loaded from S3 into Spark.

Transformation is required on the data coming from the existing JSON logs in order to create the dimension and fact tables:

Grouping data points into dimension tables (ex. user_id with first_name)
Transforming the UNIX epoch timestamp into a datetime and its components (hours, months, etc) for easy access to each
Matching artist name and song title from the event logs to the songs to add a song_id and artist_id to the songplays table.

Sample Queries

Number of listens by artist:

SELECT a.name, COUNT(s.session_id) FROM songplays s JOIN artists a ON s.artist_id = a.artist_id GROUP BY a.name;

Listen counts by song by location:

SELECT sp.location, s.title, COUNT(sp.session_id) FROM songplays sp JOIN songs s ON sp.song_id = s.song_id GROUP BY s.title, sp.location

Song listen counts by paid users:

SELECT s.title, COUNT(sp.session_id) FROM songplays sp JOIN songs s ON sp.song_id = s.song_id  WHERE sp.level = 'paid' GROUP BY s.title

Running the Project

In either AWS console or via the AWS CLI, create:

Create an S3 bucket for parquet file output (data lake)
Install Python, Java (OpenJDK8+), and Apache Spark
Create running Spark cluster, locally accessible (see docker-compose up to auto-create a local cluster on spark-network)
Run python etl.py (alternately submit to spark)

If running docker, use the following to start a local cluster with an attached Jupyter Notebook:

docker-compose up --scale spark-worker=3 && docker-compose rm -fsv

Example Config File

AWS_ACCESS_KEY_ID=''
AWS_SECRET_ACCESS_KEY=''
INPUT='location of song and log files'
OUTPUT='location to write processed parquet files'

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dl.cfg		dl.cfg
docker-compose.yaml		docker-compose.yaml
etl.py		etl.py
start-master.sh		start-master.sh
start-worker.sh		start-worker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Song Analysis Spark ETL Pipeline

Goals

Schema

Fact Table

Dimension Tables

Pipeline Steps

Sample Queries

Running the Project

Example Config File

About

Releases

Packages

Languages

jalexspringer/sparkify-s3

Folders and files

Latest commit

History

Repository files navigation

Sparkify Song Analysis Spark ETL Pipeline

Goals

Schema

Fact Table

Dimension Tables

Pipeline Steps

Sample Queries

Running the Project

Example Config File

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages