This GitHub repository contains code and documentation for a building a data warehouse using pyspark sql, a comprehensive data analysis, and predictive modeling. The project covers a wide range of tasks, including data preprocessing, exploratory data analysis (EDA), segmentation analysis, and classification analysis. Additionally, it includes a predictive modeling component using linear regression.
- Introduction
- Installation
- Data Processing
- Exploratory Data Analysis (EDA)
- Segmentation Analysis
- Classification Analysis
- Predictive Modeling
- Conclusion
This project demonstrates various data analysis and modeling techniques using PySpark and Python libraries. It involves creating a data warehouse from facts and dimensions tables, and covers tasks such as data extraction, transformation, and loading (ETL), exploratory data analysis, customer segmentation, classification analysis, and predictive modeling.
To run the code in this repository, you need to install the required dependencies. Here are the installation instructions for some key components:
- Installing PySpark:
!pip install pyspark
- Importing Libraries: Import SparkSession, Pandas, Seaborn, Matplotlib, and Numpy Libraries.
- Installing additional Python libraries: Pandas, Numpy, Matplotlib, Plotly Express, Seaborn, and other necessary libraries.
- Setting up SparkSession for data processing.
- Importing fact and dimension tables.
- Reading imported tables and performing necessary data transformations.
- Conducting EDA to understand data distribution, relationships, and trends.
- Calculating customer age and performing data visualizations to gain insights.
- Performing customer segmentation using K-means clustering.
- Identifying and interpreting clusters based on customer demographics and behavior.
- Visualizing segmentation results.
- Implementing classification analysis using Decision Trees and Random Forest Classification.
- Handling oversampling using SMOTE.
- Evaluating and visualizing classification model performance.
- Building a predictive model using Linear Regression.
- Preprocessing data and training the model.
- Evaluating the model's performance and visualizing the results.
- Summarizing key findings and insights from the analysis.
- Highlighting the success of the predictive model.
- Providing insights into the dataset and areas for further exploration.
This GitHub repository serves as a comprehensive guide to the analysis and modeling process, providing both code and documentation to replicate the results and gain a deeper understanding of the dataset.