IPL Data Engineering and Analytics with Apache Spark and Databricks 🏏📊🚀

This project showcases the development and execution of a data engineering and analytics pipeline using the IPL (Indian Premier League) dataset. The primary objective is to demonstrate data storage, transformation, analysis, and visualization using Amazon S3, Apache Spark, Databricks, SQL, and Python.

Key Components and Functionalities 🔧

Data Storage 📦
- Amazon S3: Data storage solution where the IPL dataset is uploaded and managed.
Data Transformation and Analysis 🔄
- Apache Spark: Used for data transformation and processing.
- SQL: Employed for data querying and analysis.
Data Visualization 📊
- Matplotlib and Seaborn: Used for creating insightful visualizations of the analyzed data.

Dataset 📚

The IPL dataset used in this project contains ball-by-ball data of all IPL matches up to the 2017 season. The dataset is sourced from data.world and can be accessed here.

Setup 🛠️

Prerequisites 📋

AWS account with access to S3.
Databricks account.
Python 3.x installed.

Installation ⚙️

Clone the repository:

https://github.com/mayurasandakalum/ipl-data-engineering-spark-databricks.git

Navigate to the project directory:

cd ipl-data-engineering-spark-databricks

Usage 🚀

Data Storage on Amazon S3

Upload the IPL dataset files to your S3 bucket.
Ensure the bucket is publicly accessible if required.

Data Transformation and Analysis

Setup Databricks:
- Create a new cluster on Databricks.
- Import the notebook files and attach them to your cluster.
- Execute the cells to run the data transformation and analysis steps.

Data Visualizations

Average Runs Scored by Batsmen in Winning Matches:
Distribution of Scores by Venue:
Impact of Winning Toss on Match Outcomes:
Most Economical Bowlers in Powerplay Overs:
Most Frequent Dismissal Types:
Team Performance After Winning Toss:

Code Samples 💻

Data Transformation Code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

spark = SparkSession.builder.appName("IPL Data Analysis").getOrCreate()

# Read data from S3
df = spark.read.csv("s3://your-bucket-name/ipl-data.csv", header=True, inferSchema=True)

# Data transformation
transformed_df = df.groupBy("season").agg(sum("runs").alias("total_runs"), avg("runs").alias("avg_runs"))
transformed_df.show()

SQL Queries:

SELECT player_name, SUM(runs) as total_runs
FROM ball_by_ball
GROUP BY player_name
ORDER BY total_runs DESC
LIMIT 10;

Contributing 🤝

Feel free to fork this repository, make enhancements, and submit pull requests. Your contributions are welcome!

Acknowledgments 🙏

Special thanks to the creators of the datasets and the open-source tools used in this project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
screenshots		screenshots
README.md		README.md
ipl_data_engineering_analytics.ipynb		ipl_data_engineering_analytics.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPL Data Engineering and Analytics with Apache Spark and Databricks 🏏📊🚀

Key Components and Functionalities 🔧

Dataset 📚

Setup 🛠️

Prerequisites 📋

Installation ⚙️

Usage 🚀

Data Storage on Amazon S3

Data Transformation and Analysis

Data Visualizations

Code Samples 💻

Contributing 🤝

Acknowledgments 🙏

About

Languages

mayurasandakalum/ipl-data-engineering-spark-databricks

Folders and files

Latest commit

History

Repository files navigation

IPL Data Engineering and Analytics with Apache Spark and Databricks 🏏📊🚀

Key Components and Functionalities 🔧

Dataset 📚

Setup 🛠️

Prerequisites 📋

Installation ⚙️

Usage 🚀

Data Storage on Amazon S3

Data Transformation and Analysis

Data Visualizations

Code Samples 💻

Contributing 🤝

Acknowledgments 🙏

About

Topics

Resources

Stars

Watchers

Forks

Languages