A research project inspired by Yandex ML trainings. It's for predicting house prices in Ames with exploratory analysis (EDA) and visual comparisons of regression models.
The project includes:
- Preprocess continuous and categorical features
- Custom regression models
- Оne-hot encoding for categorical variables
- Training pipelines
- Visualize data and model performance via metrics such as MAE, RMSLE, etc
This project builds a regression model to predict housing prices in Ames, Iowa.
I use:
- Continuous data transformations: scaling.
- Categorical data encoding via
OneHotEncoder
. - Custom linear regression using stochastic gradient descent, with optional L2 regularization.
- Hyperparameter tuning: grid search and cross-validation.
- Logarithmic transformations of the target variable for improved model stability.
The data is split into training and testing sets. The Ames Housing key columns include:
- Continuous features:
Year_Built
. - Categorical features:
Overall_Qual
. - Target variable:
Sale_Price
.
-
Clone the repository:
git clone https://github.com/estnafinema0/Housing-Price-Analysis.git cd Housing-Price-Analysis
-
Create a virtual environment:
python -m venv venv source venv/bin/activate
Working on Linux.
-
Install requirements:
pip install -r requirements.txt
The complete analysis is in notebook.ipynb
.
Just open and run cells in sequence!
If you want to check particular parts of the project, look at the scripts/
folder.
-
BaseDataPreprocessor
- Picks out number-based columns you want to use
- Makes all numbers work on the same scale
-
SmartDataPreprocessor
- Adds helpful new data like distance to city center
- Fixes missing data by using middle values
- Makes numbers ready for the model to use
- Makes predictions better by using real-world knowledge
-
OneHotPreprocessor
- Built on top of
BaseDataPreprocessor
- Turns text data (like house zones and sale types) into numbers the model can use
- Built on top of
-
ExponentialLinearRegression
- A special version of the
Ridge
model - Changes house prices to a better format while learning
- Changes them back when making predictions
- A special version of the
-
SGDLinearRegressor
- A model that learns step by step
- Keeps track of how well it's learning
- Shows you how it improves over time
We have several ready-to-use pipelines to make predictions:
-
make_base_pipeline()
- The simple version that works with just numbers
- Uses
BaseDataPreprocessor
and basicRidge
model
-
make_onehot_pipeline()
- Our best performer!
- Handles both numbers and categories (like house zones)
- Uses
OneHotPreprocessor
to turn text into numbers
-
make_smart_pipeline()
- Uses
SmartDataPreprocessor
to add helpful new data - Good for when you want to use location data
- Uses
Each pipeline combines data preparation and model training into one easy step. Just use fit()
and predict()
! 😊
Step-by-step visualizations.
The distribution of house prices shows a clear right-skew pattern. We found that log-transformation makes the data more normally distributed, which helps our models perform better.
For 'Sale_Price' could be important location-based features (Longitude, Latitude). Also they show weak correlations with other features.
We see interesting neighborhood patterns:
- Higher-priced clusters in northern areas
- Price variations more tied to neighborhood than distance to center
- OneHot Pipeline leading with lowest MAE (~18,000)
- Clear performance ranking: OneHot > Exponential > Base > SGD
The SGD Regressor's training shows:
- Loss stabilization around 800 iterations
- After 800 iterations, the model's performance starts to decrease, because of the overfitting.
The scatter plots show that the OneHot Pipeline follows the ideal prediction line most closely.
💡 Check out
notebook.ipynb
to recreate these visualizations.
Model | MAE | RMSLE | Notes |
---|---|---|---|
OneHot Pipeline | 18,000 | 0.155 | Best performer! Great with categorical features |
Base Pipeline | 23,000 | 0.190 | Simple but stable baseline |
Exponential Pipeline | 20,500 | 0.182 | Good with price distribution |
SGD Regressor | 26,000 | 0.200 | Shows instability after 800 iterations |
💡 Key results:
- OneHot Pipeline shows best results across both metrics
- SGD Regressor needs more tuning to compete with other models
Thanks for checking out, guys!