This project is a data engineering pipeline implemented on Amazon Web Services (AWS) for processing Spotify data. The pipeline involves loading CSV files containing information about artists, tracks, and albums into an S3 bucket, performing ETL (Extract, Transform, Load) using AWS Glue, storing the processed data as Parquet files, and finally querying and visualizing the data using Amazon Athena and Power BI.
- CSV Files: Manually upload CSV files containing information about artists, tracks, and albums to an S3 bucket.
-
Amazon S3 (Simple Storage Service): Used for storing the raw CSV files.
-
AWS Glue: Used for ETL operations.
- ETL Script: Performs inner joins on artist, track, and album data.
- Target Data Format: Parquet files.
- IAM Role: Ensure that the Glue ETL job has appropriate permissions to access S3 and write data to S3.
-
AWS Glue Crawler: Used to infer the schema of the Parquet files and create metadata tables.
-
Amazon Athena: Used for querying the data stored in the Parquet files.
- SQL Queries: Run SQL queries to analyze the data.
- Power BI: Connect Power BI to Amazon Athena to create dashboards and visualize the data.
-
Upload CSV Files to S3: Manually upload CSV files containing artist, track, and album data to the specified S3 bucket.
-
AWS Glue ETL Job:
- Create an ETL job in AWS Glue.
- Write a script to perform inner joins on the artist, track, and album data.
- Save the transformed data as Parquet files in another S3 bucket.
- Make sure to assign the appropriate IAM role to the Glue job.
-
AWS Glue Crawler:
- Create a crawler to infer the schema of the Parquet files generated by the ETL job.
- Run the crawler to create metadata tables.
-
Amazon Athena:
- Connect Amazon Athena to the S3 bucket where Parquet files are stored.
- Run SQL queries to analyze the data and gain insights.
-
Power BI Integration:
- Connect Power BI to Amazon Athena as a data source.
- Create visualizations and dashboards to represent the analyzed Spotify data.
- Ensure proper IAM roles and permissions are set up for each AWS service to access S3, Glue, and Athena.
- Monitor AWS costs, especially related to data storage and query execution.
- Regularly update the pipeline to accommodate changes in data structure or requirements.
For questions or feedback, please contact riteshojha2002@gmail.com.