Skip to content

DeepankarAcharyya/Data_Lake_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Lake Project

Python3 PySpark Udacityproject DataEngineering ETL

AWS GIT



About The Project

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

They would like a data engineer to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.


Project Description

In this project, the data engineer is required to build an ETL pipeline for a data lake hosted on S3. To complete the project, the data needs to be loaded from S3, process the data into analytics tables using Spark, and load them back into S3. The Spark process needs to be deployed on a cluster using AWS.



Project Datasets :


We will be working with 2 datasets (Song Data & Log Data) in this project, that resides as json files in AWS S3.

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

Below is an example of what a single song JSON file looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset are partitioned by year and month.


log_data

Database Schema

The database schema consists of the following tables :

Fact Table

  • songplay_table - records in event data associated with song plays i.e. records with page NextSong - songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

  • user_table - users in the app - user_id, first_name, last_name, gender, level
  • song_table - songs in music database - song_id, title, artist_id, year, duration
  • artist_table - artists in music database - artist_id, name, location, lattitude, longitude
  • time_table - timestamps of records in songplays broken down into specific units - start_time, hour, day, week, month, year, weekday



Project Structure


File / Folder Description
static_resources Folder at the root of the project, where static resources/images are present
etl.py Reads data from S3, processes that data using Spark, and writes them back to S3
dl.cfg Sample configuration file for AWS
README Readme file


Project Workflow


project_workflow



Deployement


  • First in dl.cfg provide the credentials to access the cluster :
AWS_ACCESS_KEY_ID='<YOUR_AWS_ACCESS_KEY>'
AWS_SECRET_ACCESS_KEY='<YOUR_AWS_SECRET_KEY>'
  • Set the s3 path for the output data in etl.py.
output_data = ""
  • If using local as the development environemnt, we need to move the scripts from local to EMR
     scp -i <.pem-file>  <Local-Path>  <username> @ <EMR-MasterNode-Endpoint> : ~<EMR-path>
  • In order to run the script as spark job :
    spark-submit etl.py --master yarn --deploy-mode client 

NOTE : Before running job make sure EMR Role have access to s3




LinkedIn

Github



About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages