Housing Price Analysis

A research project inspired by Yandex ML trainings. It's for predicting house prices in Ames with exploratory analysis (EDA) and visual comparisons of regression models.

The project includes:

Preprocess continuous and categorical features
Custom regression models
Оne-hot encoding for categorical variables
Training pipelines
Visualize data and model performance via metrics such as MAE, RMSLE, etc

Overview

This project builds a regression model to predict housing prices in Ames, Iowa.
I use:

Continuous data transformations: scaling.
Categorical data encoding via OneHotEncoder.
Custom linear regression using stochastic gradient descent, with optional L2 regularization.
Hyperparameter tuning: grid search and cross-validation.
Logarithmic transformations of the target variable for improved model stability.

Data description

The data is split into training and testing sets. The Ames Housing key columns include:

Continuous features: Year_Built.
Categorical features: Overall_Qual.
Target variable: Sale_Price.

Installation

Clone the repository:

git clone https://github.com/estnafinema0/Housing-Price-Analysis.git
cd Housing-Price-Analysis

Create a virtual environment:

python -m venv venv
source venv/bin/activate

Working on Linux.

Install requirements:
```
pip install -r requirements.txt
```

Usage

Where is the project?...

The complete analysis is in notebook.ipynb. Just open and run cells in sequence!

Scripts

If you want to check particular parts of the project, look at the scripts/ folder.

Classes and Models

Data Preprocessing

BaseDataPreprocessor
- Picks out number-based columns you want to use
- Makes all numbers work on the same scale
SmartDataPreprocessor
- Adds helpful new data like distance to city center
- Fixes missing data by using middle values
- Makes numbers ready for the model to use
- Makes predictions better by using real-world knowledge
OneHotPreprocessor
- Built on top of BaseDataPreprocessor
- Turns text data (like house zones and sale types) into numbers the model can use

Custom Models

ExponentialLinearRegression
- A special version of the Ridge model
- Changes house prices to a better format while learning
- Changes them back when making predictions
SGDLinearRegressor
- A model that learns step by step
- Keeps track of how well it's learning
- Shows you how it improves over time

Pipelines

We have several ready-to-use pipelines to make predictions:

make_base_pipeline()
- The simple version that works with just numbers
- Uses BaseDataPreprocessor and basic Ridge model
make_onehot_pipeline()
- Our best performer!
- Handles both numbers and categories (like house zones)
- Uses OneHotPreprocessor to turn text into numbers
make_smart_pipeline()
- Uses SmartDataPreprocessor to add helpful new data
- Good for when you want to use location data

Each pipeline combines data preparation and model training into one easy step. Just use fit() and predict()! 😊

📊 Visualizations

Step-by-step visualizations.

Data Analysis

The distribution of house prices shows a clear right-skew pattern. We found that log-transformation makes the data more normally distributed, which helps our models perform better.

Feature Correlations

For 'Sale_Price' could be important location-based features (Longitude, Latitude). Also they show weak correlations with other features.

Location Analysis

We see interesting neighborhood patterns:

Higher-priced clusters in northern areas
Price variations more tied to neighborhood than distance to center

Model Performance

Our model comparison shows:

OneHot Pipeline leading with lowest MAE (~18,000)
Clear performance ranking: OneHot > Exponential > Base > SGD

Training Dynamics

The SGD Regressor's training shows:

Loss stabilization around 800 iterations
After 800 iterations, the model's performance starts to decrease, because of the overfitting.

Prediction Accuracy

The scatter plots show that the OneHot Pipeline follows the ideal prediction line most closely.

💡 Check out notebook.ipynb to recreate these visualizations.

Results

Model	MAE	RMSLE	Notes
OneHot Pipeline	18,000	0.155	Best performer! Great with categorical features
Base Pipeline	23,000	0.190	Simple but stable baseline
Exponential Pipeline	20,500	0.182	Good with price distribution
SGD Regressor	26,000	0.200	Shows instability after 800 iterations

💡 Key results:

OneHot Pipeline shows best results across both metrics

SGD Regressor needs more tuning to compete with other models

Thanks for checking out, guys!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Housing Price Analysis

Table of Contents

Overview

Data description

Installation