This project showcases the development and execution of a data engineering and analytics pipeline using the IPL (Indian Premier League) dataset. The primary objective is to demonstrate data storage, transformation, analysis, and visualization using Amazon S3, Apache Spark, Databricks, SQL, and Python.
-
Data Storage 📦
-
Data Transformation and Analysis 🔄
-
Data Visualization 📊
- Matplotlib and Seaborn: Used for creating insightful visualizations of the analyzed data.
The IPL dataset used in this project contains ball-by-ball data of all IPL matches up to the 2017 season. The dataset is sourced from data.world and can be accessed here.
- AWS account with access to S3.
- Databricks account.
- Python 3.x installed.
-
Clone the repository:
https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks.git
-
Navigate to the project directory:
cd ipl-data-engineering-spark-databricks
- Upload the IPL dataset files to your S3 bucket.
- Ensure the bucket is publicly accessible if required.
- Setup Databricks:
-
Data Transformation Code:
from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum, avg spark = SparkSession.builder.appName("IPL Data Analysis").getOrCreate() # Read data from S3 df = spark.read.csv("s3://your-bucket-name/ipl-data.csv", header=True, inferSchema=True) # Data transformation transformed_df = df.groupBy("season").agg(sum("runs").alias("total_runs"), avg("runs").alias("avg_runs")) transformed_df.show()
-
SQL Queries:
SELECT player_name, SUM(runs) as total_runs FROM ball_by_ball GROUP BY player_name ORDER BY total_runs DESC LIMIT 10;
Feel free to fork this repository, make enhancements, and submit pull requests. Your contributions are welcome!
Special thanks to the creators of the datasets and the open-source tools used in this project.