Pyspark Data-warehouse and Predictive modeling

Overview

This GitHub repository contains code and documentation for a building a data warehouse using pyspark sql, a comprehensive data analysis, and predictive modeling. The project covers a wide range of tasks, including data preprocessing, exploratory data analysis (EDA), segmentation analysis, and classification analysis. Additionally, it includes a predictive modeling component using linear regression.

Introduction

This project demonstrates various data analysis and modeling techniques using PySpark and Python libraries. It involves creating a data warehouse from facts and dimensions tables, and covers tasks such as data extraction, transformation, and loading (ETL), exploratory data analysis, customer segmentation, classification analysis, and predictive modeling.

Installation

To run the code in this repository, you need to install the required dependencies. Here are the installation instructions for some key components:

Installing PySpark: !pip install pyspark
Importing Libraries: Import SparkSession, Pandas, Seaborn, Matplotlib, and Numpy Libraries.
Installing additional Python libraries: Pandas, Numpy, Matplotlib, Plotly Express, Seaborn, and other necessary libraries.

Data Processing

Setting up SparkSession for data processing.
Importing fact and dimension tables.
Reading imported tables and performing necessary data transformations.

Exploratory Data Analysis (EDA)

Conducting EDA to understand data distribution, relationships, and trends.
Calculating customer age and performing data visualizations to gain insights.

Segmentation Analysis

Performing customer segmentation using K-means clustering.
Identifying and interpreting clusters based on customer demographics and behavior.
Visualizing segmentation results.

Classification Analysis

Implementing classification analysis using Decision Trees and Random Forest Classification.
Handling oversampling using SMOTE.
Evaluating and visualizing classification model performance.

Predictive Modeling

Building a predictive model using Linear Regression.
Preprocessing data and training the model.
Evaluating the model's performance and visualizing the results.

Conclusion

Summarizing key findings and insights from the analysis.
Highlighting the success of the predictive model.
Providing insights into the dataset and areas for further exploration.

This GitHub repository serves as a comprehensive guide to the analysis and modeling process, providing both code and documentation to replicate the results and gain a deeper understanding of the dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Pyspark_Data_Warehousing_and_Predictive_Modeling.ipynb		Pyspark_Data_Warehousing_and_Predictive_Modeling.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pyspark Data-warehouse and Predictive modeling

Overview

Table of Contents

Introduction

Installation

Data Processing

Exploratory Data Analysis (EDA)

Segmentation Analysis

Classification Analysis

Predictive Modeling

Conclusion

About

Releases

Packages

Languages

Adettoun/Pyspark-Data-warehousing

Folders and files

Latest commit

History

Repository files navigation

Pyspark Data-warehouse and Predictive modeling

Overview

Table of Contents

Introduction

Installation

Data Processing

Exploratory Data Analysis (EDA)

Segmentation Analysis

Classification Analysis

Predictive Modeling

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages