Create an ETL pipeline that extracts data from S3, and transforms data into a set of dimensional tables, and load them into another S3 bucket for the analytics team to continue finding insights into what songs and artists their users are listening to.
- What time of day is an artist/song typically listened to?
- What are paid users listening to the most?
- What is the top listened to artist/song by location?
- What is an artist's most listened to song?
- What is an artist's total listens across all of their songs?
Data is dimensionalised into the users, songs, artists, and time tables and the required Fact table 'songplays' is created on which the team will run their queries. Analysis able to be done initially on the IDs in the fact table for performance benefits.
1. **songplays** - records in log data associated with song plays i.e. records with page `NextSong`
- *songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*
1. **users** - users in the app
- *user_id, first_name, last_name, gender, level*
2. **songs** - songs in music database
- *song_id, title, artist_id, year, duration*
3. **artists** - artists in music database
- *artist_id, name, location, latitude, longitude*
4. **time** - timestamps of records in **songplays** broken down into specific units
- *start_time, hour, day, week, month, year, weekday*
Song and event log data is loaded from S3 into Spark.
Transformation is required on the data coming from the existing JSON logs in order to create the dimension and fact tables:
- Grouping data points into dimension tables (ex. user_id with first_name)
- Transforming the UNIX epoch timestamp into a datetime and its components (hours, months, etc) for easy access to each
- Matching artist name and song title from the event logs to the songs to add a song_id and artist_id to the songplays table.
Number of listens by artist:
SELECT a.name, COUNT(s.session_id) FROM songplays s JOIN artists a ON s.artist_id = a.artist_id GROUP BY a.name;
Listen counts by song by location:
SELECT sp.location, s.title, COUNT(sp.session_id) FROM songplays sp JOIN songs s ON sp.song_id = s.song_id GROUP BY s.title, sp.location
Song listen counts by paid users:
SELECT s.title, COUNT(sp.session_id) FROM songplays sp JOIN songs s ON sp.song_id = s.song_id WHERE sp.level = 'paid' GROUP BY s.title
In either AWS console or via the AWS CLI, create:
- Create an S3 bucket for parquet file output (data lake)
- Install Python, Java (OpenJDK8+), and Apache Spark
- Create running Spark cluster, locally accessible (see docker-compose up to auto-create a local cluster on spark-network)
- Run python etl.py (alternately submit to spark)
If running docker, use the following to start a local cluster with an attached Jupyter Notebook:
docker-compose up --scale spark-worker=3 && docker-compose rm -fsv
AWS_ACCESS_KEY_ID=''
AWS_SECRET_ACCESS_KEY=''
INPUT='location of song and log files'
OUTPUT='location to write processed parquet files'