A startup named Sparkify wants to analyze user activities using their song and user data. The current data is spread among several JSON files, making it hard to query and analyze.
This project aims to create an ETL pipeline to load song and user data to a Postgres database, making it easier to query and analyze data.
Data is currently collected for song and user activities, in two directories:
data/log_data
and data/song_data
, using JSON files.
{
"num_songs": 1,
"artist_id": "ARGSJW91187B9B1D6B",
"artist_latitude": 35.21962,
"artist_longitude": -80.01955,
"artist_location": "North Carolina",
"artist_name": "JennyAnyKind",
"song_id": "SOQHXMF12AB0182363",
"title": "Young Boy Blues",
"duration": 218.77506,
"year": 0
}
{
"artist": "Survivor",
"auth": "Logged In",
"firstName": "Jayden",
"gender": "M",
"itemInSession": 0,
"lastName": "Fox",
"length": 245.36771,
"level": "free",
"location": "New Orleans-Metairie, LA",
"method": "PUT",
"page": "NextSong",
"registration": 1541033612796,
"sessionId": 100,
"song": "Eye Of The Tiger",
"status": 200,
"ts": 1541110994796,
"userAgent": "\"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
"userId": "101"
}
Records in log data associated with song plays i.e. records with page
set to
NextSong
.
Column | Type | Nullable |
---|---|---|
songplay_id | integer | not null |
start_time | timestamp without time zone | not null |
user_id | integer | not null |
level | character varying | not null |
song_id | character varying(18) | |
artist_id | character varying(18) | |
session_id | integer | not null |
location | character varying | not null |
user_agent | character varying | not null |
Primary key: songplay_id
Users in the app.
Column | Type | Nullable |
---|---|---|
user_id | integer | not null |
first_name | character varying | not null |
last_name | character varying | not null |
gender | character(1) | not null |
level | character varying | not null |
Primary key: user_id
Songs in music database.
Column | Type | Nullable |
---|---|---|
song_id | character varying(18) | not null |
title | character varying | not null |
artist_id | character varying(18) | not null |
year | integer | not null |
duration | double precision | not null |
Primary key: song_id
Artists in music database.
Column | Type | Nullable |
---|---|---|
artist_id | character varying(18) | not null |
name | character varying | not null |
location | character varying | not null |
latitude | double precision | |
longitude | double precision |
Primary key: artist_id
Timestamps of records in songplays broken down into specific units.
Column | Type | Nullable |
---|---|---|
start_time | timestamp without time zone | not null |
hour | integer | not null |
day | integer | not null |
week | integer | not null |
month | integer | not null |
year | integer | not null |
weekday | integer | not null |
Pre-requisites:
- Python 3
- pipenv
- pyenv (optional)
- PostgreSQL Database
To install project python dependencies, you should run:
pipenv install
The database can be installed locally or ran using Docker, which is the preferred method.
To use docker to run Postgres, you should run:
docker run --net=host --name postgres -e POSTGRES_PASSWORD=your_password -d postgres
To initially access the database, you should run:
psql -h localhost -U postgres
You should run the following commands under psql to setup user access to
Postgres and create the initial sparkifydb
database:
CREATE ROLE student WITH ENCRYPTED PASSWORD 'student';
ALTER ROLE student WITH LOGIN;
ALTER ROLE student CREATEDB;
CREATE DATABASE sparkifydb OWNER student;
To run the project locally, use pipenv to activate the virtual environment:
pipenv shell
And run the scripts to create database tables:
./create_tables.py
and populate data into tables:
./etl.py
Data can be verified using the provided test.ipynb
jupyter notebook:
jupyter notebook