PySpark Projects Showcase

A compilation of big data cleaning and transformation projects performed using PySpark for preparation towards analytics consumption covering the following domains of analytics: Sports, e-commerce, and object detection; demonstrating essential data engineering skills such as data cleaning, data transformation, feature engineering, in preparation for insights reporting and model building using the cleansed data.

For starters:

Read here on why PySpark is currently among the most essential tools used by data engineers in handling big data through its distributed computing model as a replacement of traditional computing.
Install PySpark here OR refer to the DataCamp tutorial here. ALternatively, you may clone this repository by jplane to obtain a VS Code devcontainer setup for local PySpark development (1-node cluster) environment.

Skills: Data Engineering, Data Analytics

Tools: PySpark, Jupyter Notebook

Projects:

Cleaning the Stanford Dogs Annotation dataset (Posting soon)
Processing and analyzing Earthquake data (Posting soon)
E-commerce sales data cleaning and analysis (Posting soon)
Feature Engineering and Price Prediction of the 2017 St Paul MN Real Estate Data (Posting soon)
Bundesliga (Germany Football League) analysis (Posting soon)
Netflix movie cleaning and analysis (Posting soon)

Last updated August 12, 2024

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Projects Showcase

About

Releases

Packages

20100215/PySpark_Data_Management_Showcase

Folders and files

Latest commit

History

Repository files navigation

PySpark Projects Showcase

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages