Skip to content

A compilation of big data cleaning and transformation projects performed using PySpark for preparation towards analytics consumption covering the following domains of analytics: Sports, e-commerce, and object detection.

Notifications You must be signed in to change notification settings

20100215/PySpark_Data_Management_Showcase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

PySpark Projects Showcase

image

A compilation of big data cleaning and transformation projects performed using PySpark for preparation towards analytics consumption covering the following domains of analytics: Sports, e-commerce, and object detection; demonstrating essential data engineering skills such as data cleaning, data transformation, feature engineering, in preparation for insights reporting and model building using the cleansed data.

For starters:

  1. Read here on why PySpark is currently among the most essential tools used by data engineers in handling big data through its distributed computing model as a replacement of traditional computing.
  2. Install PySpark here OR refer to the DataCamp tutorial here. ALternatively, you may clone this repository by jplane to obtain a VS Code devcontainer setup for local PySpark development (1-node cluster) environment.

Skills: Data Engineering, Data Analytics

Tools: PySpark, Jupyter Notebook

Projects:

  1. Cleaning the Stanford Dogs Annotation dataset (Posting soon)
  2. Processing and analyzing Earthquake data (Posting soon)
  3. E-commerce sales data cleaning and analysis (Posting soon)
  4. Feature Engineering and Price Prediction of the 2017 St Paul MN Real Estate Data (Posting soon)
  5. Bundesliga (Germany Football League) analysis (Posting soon)
  6. Netflix movie cleaning and analysis (Posting soon)

Last updated August 12, 2024

About

A compilation of big data cleaning and transformation projects performed using PySpark for preparation towards analytics consumption covering the following domains of analytics: Sports, e-commerce, and object detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published