Data Lake Project

About The Project

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They would like a data engineer to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

Project Description

In this project, the data engineer is required to build an ETL pipeline for a data lake hosted on S3. To complete the project, the data needs to be loaded from S3, process the data into analytics tables using Spark, and load them back into S3. The Spark process needs to be deployed on a cluster using AWS.

Project Datasets :

We will be working with 2 datasets (Song Data & Log Data) in this project, that resides as json files in AWS S3.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

Below is an example of what a single song JSON file looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset are partitioned by year and month.

Database Schema

The database schema consists of the following tables :

Fact Table

songplay_table - records in event data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

user_table - users in the app - user_id, first_name, last_name, gender, level
song_table - songs in music database - song_id, title, artist_id, year, duration
artist_table - artists in music database - artist_id, name, location, lattitude, longitude
time_table - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday

Project Structure

File / Folder	Description
static_resources	Folder at the root of the project, where static resources/images are present
etl.py	Reads data from S3, processes that data using Spark, and writes them back to S3
dl.cfg	Sample configuration file for AWS
README	Readme file

Project Workflow

Deployement

First in dl.cfg provide the credentials to access the cluster :

AWS_ACCESS_KEY_ID='<YOUR_AWS_ACCESS_KEY>'
AWS_SECRET_ACCESS_KEY='<YOUR_AWS_SECRET_KEY>'

Set the s3 path for the output data in etl.py.

output_data = ""

If using local as the development environemnt, we need to move the scripts from local to EMR

     scp -i <.pem-file>  <Local-Path>  <username> @ <EMR-MasterNode-Endpoint> : ~<EMR-path>

In order to run the script as spark job :

    spark-submit etl.py --master yarn --deploy-mode client

NOTE : Before running job make sure EMR Role have access to s3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake Project

About The Project

Project Description

Project Datasets :

Song Dataset

Log Dataset

Database Schema

Fact Table

Dimension Tables

Project Structure

Project Workflow

Deployement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
static_resources		static_resources
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py

DeepankarAcharyya/Data_Lake_Project

Folders and files

Latest commit

History

Repository files navigation

Data Lake Project

About The Project

Project Description

Project Datasets :

Song Dataset

Log Dataset

Database Schema

Fact Table

Dimension Tables

Project Structure

Project Workflow

Deployement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages