Skip to content

This GitHub repository contains code and documentation for a building a data warehouse using pyspark sql, a comprehensive data analysis, and predictive modeling

Notifications You must be signed in to change notification settings

Adettoun/Pyspark-Data-warehousing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Pyspark Data-warehouse and Predictive modeling

Overview

This GitHub repository contains code and documentation for a building a data warehouse using pyspark sql, a comprehensive data analysis, and predictive modeling. The project covers a wide range of tasks, including data preprocessing, exploratory data analysis (EDA), segmentation analysis, and classification analysis. Additionally, it includes a predictive modeling component using linear regression.

Table of Contents

  1. Introduction
  2. Installation
  3. Data Processing
  4. Exploratory Data Analysis (EDA)
  5. Segmentation Analysis
  6. Classification Analysis
  7. Predictive Modeling
  8. Conclusion

Introduction

This project demonstrates various data analysis and modeling techniques using PySpark and Python libraries. It involves creating a data warehouse from facts and dimensions tables, and covers tasks such as data extraction, transformation, and loading (ETL), exploratory data analysis, customer segmentation, classification analysis, and predictive modeling.

Installation

To run the code in this repository, you need to install the required dependencies. Here are the installation instructions for some key components:

  • Installing PySpark: !pip install pyspark
  • Importing Libraries: Import SparkSession, Pandas, Seaborn, Matplotlib, and Numpy Libraries.
  • Installing additional Python libraries: Pandas, Numpy, Matplotlib, Plotly Express, Seaborn, and other necessary libraries.

Data Processing

  • Setting up SparkSession for data processing.
  • Importing fact and dimension tables.
  • Reading imported tables and performing necessary data transformations.

Exploratory Data Analysis (EDA)

  • Conducting EDA to understand data distribution, relationships, and trends.
  • Calculating customer age and performing data visualizations to gain insights.

Segmentation Analysis

  • Performing customer segmentation using K-means clustering.
  • Identifying and interpreting clusters based on customer demographics and behavior.
  • Visualizing segmentation results.

Classification Analysis

  • Implementing classification analysis using Decision Trees and Random Forest Classification.
  • Handling oversampling using SMOTE.
  • Evaluating and visualizing classification model performance.

Predictive Modeling

  • Building a predictive model using Linear Regression.
  • Preprocessing data and training the model.
  • Evaluating the model's performance and visualizing the results.

Conclusion

  • Summarizing key findings and insights from the analysis.
  • Highlighting the success of the predictive model.
  • Providing insights into the dataset and areas for further exploration.

This GitHub repository serves as a comprehensive guide to the analysis and modeling process, providing both code and documentation to replicate the results and gain a deeper understanding of the dataset.

About

This GitHub repository contains code and documentation for a building a data warehouse using pyspark sql, a comprehensive data analysis, and predictive modeling

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published