A compilation of big data cleaning and transformation projects performed using PySpark for preparation towards analytics consumption covering the following domains of analytics: Sports, e-commerce, and object detection; demonstrating essential data engineering skills such as data cleaning, data transformation, feature engineering, in preparation for insights reporting and model building using the cleansed data.
For starters:
- Read here on why PySpark is currently among the most essential tools used by data engineers in handling big data through its distributed computing model as a replacement of traditional computing.
- Install PySpark here OR refer to the DataCamp tutorial here. ALternatively, you may clone this repository by jplane to obtain a VS Code devcontainer setup for local PySpark development (1-node cluster) environment.
Skills: Data Engineering, Data Analytics
Tools: PySpark, Jupyter Notebook
Projects:
- Cleaning the Stanford Dogs Annotation dataset (Posting soon)
- Processing and analyzing Earthquake data (Posting soon)
- E-commerce sales data cleaning and analysis (Posting soon)
- Feature Engineering and Price Prediction of the 2017 St Paul MN Real Estate Data (Posting soon)
- Bundesliga (Germany Football League) analysis (Posting soon)
- Netflix movie cleaning and analysis (Posting soon)
Last updated August 12, 2024