Skip to content

[Machine Learning] Predict sales prices using full preprocessing pipeline and advanced regression techniques (KNN, Trees, Random Forest, Boosted Trees, Stacking, XGB,...)

Notifications You must be signed in to change notification settings

nguyenngocyennga/Kaggle-House-Predict

Repository files navigation

House Prices | Machine Learning project

Predict sales prices using advanced regression techniques

Data: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

My attempt: https://github.com/nganguyenny/Kaggle-House-Predict

1. Project Setup

  • Load libraries
  • Load datasets (upload datasets on GCP storage)
  • Create scoring metric (rmsle)

2. Model Testing for Numeric Features

  • Baseline score and RandomizedSearchCV searching for best params using:
  • DecisionTreeRegressor, AdaBoostReegressor, GradientBoostingRegressor, XGBRegressor
  • RandomForestRegressor (extracting feature importances)
  • and stacking.

3. Baseline Pipeline

  • Feature overview (numerical vs categorical features)
  • Preprocessing (impute, scale, encode)
  • Estimator (with DecisionTreeRegressor)
  • Cross validation
  • Prediction

4. Features Preprocessing

  • Preprocessing: encoding (ordinal and nominal categorical features, numerical features, cyclical features), full preprocessing pipeline;
  • Feature Investigation: numerical features (VIF to measure multicolinearity), ordinal features (spearman method), univariate feature selection (SelectPercentile)

5. Model Testing with All Features

  • Data loading and target engineering
  • Apply preprocessing pipeline
  • Create RMSE score
  • Linear models (LinearRegression, Ridge)
  • KNeighborsRegressor, SVR,
  • Trees (DecisionTreeRegressor),
  • Random Forest (RandomForestRegressor),
  • Boosteed Trees (AdaBoostRegressor, GradientBoostingRegressor),
  • Stacking (VotingRegressor, StackingRegressor),
  • XGBRegressor.

6. Kaggle Submission

  • Best score using StackingRegressor (on test set locally): 0.12119
  • Actual score valuated on Kaggle (top 12% on leaderboard): 0.12396

About

[Machine Learning] Predict sales prices using full preprocessing pipeline and advanced regression techniques (KNN, Trees, Random Forest, Boosted Trees, Stacking, XGB,...)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published