Name		Name	Last commit message	Last commit date
parent directory ..
Athena.png		Athena.png
ETL.png		ETL.png
Readme.md		Readme.md
S3.png		S3.png
albums.csv		albums.csv
artist.csv		artist.csv
crawler.png		crawler.png
tracks.csv		tracks.csv

Readme.md

Spotify Data Engineering Project README

Overview

This project is a data engineering pipeline implemented on Amazon Web Services (AWS) for processing Spotify data. The pipeline involves loading CSV files containing information about artists, tracks, and albums into an S3 bucket, performing ETL (Extract, Transform, Load) using AWS Glue, storing the processed data as Parquet files, and finally querying and visualizing the data using Amazon Athena and Power BI.

Components

1. Data Source

CSV Files: Manually upload CSV files containing information about artists, tracks, and albums to an S3 bucket.

2. AWS Services

Amazon S3 (Simple Storage Service): Used for storing the raw CSV files.
AWS Glue: Used for ETL operations.
- ETL Script: Performs inner joins on artist, track, and album data.
- Target Data Format: Parquet files.
- IAM Role: Ensure that the Glue ETL job has appropriate permissions to access S3 and write data to S3.
AWS Glue Crawler: Used to infer the schema of the Parquet files and create metadata tables.
Amazon Athena: Used for querying the data stored in the Parquet files.
- SQL Queries: Run SQL queries to analyze the data.

3. Visualization

Power BI: Connect Power BI to Amazon Athena to create dashboards and visualize the data.

Setup Instructions

Upload CSV Files to S3: Manually upload CSV files containing artist, track, and album data to the specified S3 bucket.
AWS Glue ETL Job:
- Create an ETL job in AWS Glue.
- Write a script to perform inner joins on the artist, track, and album data.
- Save the transformed data as Parquet files in another S3 bucket.
- Make sure to assign the appropriate IAM role to the Glue job.
AWS Glue Crawler:
- Create a crawler to infer the schema of the Parquet files generated by the ETL job.
- Run the crawler to create metadata tables.
Amazon Athena:
- Connect Amazon Athena to the S3 bucket where Parquet files are stored.
- Run SQL queries to analyze the data and gain insights.
Power BI Integration:
- Connect Power BI to Amazon Athena as a data source.
- Create visualizations and dashboards to represent the analyzed Spotify data.

Notes

Ensure proper IAM roles and permissions are set up for each AWS service to access S3, Glue, and Athena.
Monitor AWS costs, especially related to data storage and query execution.
Regularly update the pipeline to accommodate changes in data structure or requirements.

Contact

For questions or feedback, please contact riteshojha2002@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spotify

Spotify

Readme.md

Spotify Data Engineering Project README

Overview

Components

1. Data Source

2. AWS Services

3. Visualization

Setup Instructions

Notes

Contact

Files

Spotify

Directory actions

More options

Directory actions

More options

Latest commit

History

Spotify

Folders and files

parent directory

Readme.md

Spotify Data Engineering Project README

Overview

Components

1. Data Source

2. AWS Services

3. Visualization

Setup Instructions

Notes

Contact