A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
They would like a data engineer to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.
In this project, the data engineer is required to build an ETL pipeline for a data lake hosted on S3. To complete the project, the data needs to be loaded from S3, process the data into analytics tables using Spark, and load them back into S3. The Spark process needs to be deployed on a cluster using AWS.
We will be working with 2 datasets (Song Data & Log Data) in this project, that resides as json files in AWS S3.
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
Below is an example of what a single song JSON file looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset are partitioned by year and month.
The database schema consists of the following tables :
- songplay_table - records in event data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- user_table - users in the app - user_id, first_name, last_name, gender, level
- song_table - songs in music database - song_id, title, artist_id, year, duration
- artist_table - artists in music database - artist_id, name, location, lattitude, longitude
- time_table - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday
File / Folder | Description |
---|---|
static_resources | Folder at the root of the project, where static resources/images are present |
etl.py | Reads data from S3, processes that data using Spark, and writes them back to S3 |
dl.cfg | Sample configuration file for AWS |
README | Readme file |
- First in
dl.cfg
provide the credentials to access the cluster :
AWS_ACCESS_KEY_ID='<YOUR_AWS_ACCESS_KEY>'
AWS_SECRET_ACCESS_KEY='<YOUR_AWS_SECRET_KEY>'
- Set the s3 path for the output data in
etl.py
.
output_data = ""
- If using local as the development environemnt, we need to move the scripts from local to EMR
scp -i <.pem-file> <Local-Path> <username> @ <EMR-MasterNode-Endpoint> : ~<EMR-path>
- In order to run the script as spark job :
spark-submit etl.py --master yarn --deploy-mode client
NOTE : Before running job make sure EMR Role have access to s3