Sparkify ETL

This project includes a simple toolkit to build a star schema and resolve analytics questions for Sparkify company.

An analytical database is built using Postgresql from JSON files contain song data and information of user interaction.

Database Design

The destination database is based on a star schema. The diagram represents describes the dimension and fact tables:

ETL Process

The ETL process is executed using two different data sources, the first one is about the song data, the second one contains log data with the user interaction. Both sources are flat files in JSON format located in different directories inside data directory.

Files

The toolkit includes the next Python scripts:

create_tables.py This script executes the creation of fact and dimensions tables based on an auxiliary module contains the SQL statements.
etl.py This script includes the needed code to read the JSON files, split data by schemas, and load into fact and dimension tables.
sql_queries.py This module includes the DDL and DML statements.
test.ipynb A python notebook that connects to database quickly and verify the data populated inside tables.
etl.ipynb A python notebook to interact with raw data.
data Directory with the flat files in JSON format to process.

Any change about the schema could be added inside sql_queries.py module.

Requirements

A Postgresql server should be provisioned previously to execute the scripts and credentials to connect to it.
An environment with Python 3 installed with the next python packages:
- psycopg2
- pandas

Steps to execute

Execute the script create_tables.py with the next command:
```
 python create_tables.py
```
If step one run without failures then script etl.py can be launched with the next command:
```
 python elt.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
etl.ipynb		etl.ipynb
etl.py		etl.py
sparkify_erd.png		sparkify_erd.png
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify ETL

Database Design

ETL Process

Files

Requirements

Steps to execute

About

Releases

Packages

Languages

License

mfreyeso/etl_sparkify

Folders and files

Latest commit

History

Repository files navigation

Sparkify ETL

Database Design

ETL Process

Files

Requirements

Steps to execute

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages