Sparkify ETL

Introduction

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

The analytics team is particularly interested in understanding what songs users are listening to.

We can extract the following data and more

The most played music in the entire app
The top 5 artists in the app according to their songs
The most day of the week people listen to songs
and more....

The ETL extracts the data from songs and log files into a single database as read-only database which runs with the star schema model for analysis.

The database management system used in the ETL is postgres which is realtional (SQL) database

Tools

Usage

To run the ETL, you have to create the database and the tables so we have to execute create_tables.py file.
Now we can run etl.py to transfer our data from songs and log files into the database tables.

Important: update the database and account credentials in the files to your own credentials

Congrats you executed your ETL successfully

Database Schema Design

Sparkify Schema is a star schema built for analysis and we can update it only with the ETL but users can not edit it.

Star Schema consists of facts and dimensions

Facts

Songplays fact

Records in log data associated with song plays
Records with page NextSong

Members:-

songplay_id
start_time
user_id
level
song_id
artist_id
session_id
location
user_agent

Fact table contains the business event of the analysis which is listening to a song transactions

Fact table contains foreign keys for the dimensions around it.

Dimensions

Dimensions is the business entities around the business event

We have 4 dimensions which is songs, artists, users, time

Time dimension

timestamps of records in songplays broken down into specific units

Members:-

start_time
hour
day
week
month
year
weekday

The Time dimension helps us to make the data consistant in the schema and remove redunduncies.

Songs dimension

Songs in music database

Members:-

song_id
title
artist_id
year
duration

Artists dimension

Artists in music database

Members:-

artist_id
name
location
latitude
longitude

Users diemension

Users in the app

Members:-

user_id
first_name
last_name
gender
level

Sparkify Project Files

`sql_queries.py`

The file contains all create, drop, insert and select queries which the create_tables.py executes to create the database schema and integrates the data.

`create_tables.py`

We have 3 functions in this python file and main function which executes the functions

create_database()
- Creates and connects to the sparkifydb
- Returns the connection and cursor to sparkifydb
drop_tables(cur, conn)
- Drops each table using the queries in drop_table_queries list in sql_queries.py
create_tables(cur, conn)
- Creates each table using the queries in create_table_queries` list in sql_queries.py

`etl.py`

The file processes all the files and runs the inserts to insert data into the fact and dimensions and it consists of 3 functions

process_song_file(cur, filepath)
- reads the song files from its path
- inserts the songs data and artists data to songs and artists dimensions
process_log_file(cur, filepath)
- reads the log files data from its path filtred by NextSong page
- inserts the time records from the log files into time dimension
- inserts the user records from the log files into users dimensions
- inserts the transactions which is the songs listening to songplays fact
process_data(cur, conn, filepath, func)
- Get all the files with .json extension which are the songs files and log files

`etl.ipynb`

This jupyter notebook file used for development to get the data from the first file only and ensure that the data is valid for insertion

`test.ipynb`

This file contains test queries and sanity tests for checking database schema and it's quality The file used sql magic to execute their queries

data folder

Contains log files in json and songs files and every file of songs files contains one song data

Future work

Create a dashboard for analytic queries on sparkify database in Microsoft Power BI

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
data		data
images		images
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify ETL

Table of contents

Introduction

Tools

Usage