Determine: Which variables are significant in predicting the price of a house, and How well those variables describe the price of a house.
Read Data
- Import important libraries
- Read housing data into dataframe
- Quick review of dataframe
Data Preparation
- Check missing values in dataframe
- Drop columns withh more than 80% missing values
- Impute LotFrontage
- Impute FireplaceQu
- Impute Garage related fields
- Impute Basement related fields
- Impute missing categorical variables with mode
- Impute missing quantitative variables with median
Feature Engineering
- Calculate age of house when sold
Data Analysis
- Check distribution of target variable
- Transform targert variable (log transformation)
- Create list of numeric and non-numeric columns
- Analyze outliers from quantitative variables
- Remove outliers from numerical data
- Bar plots of quantitative variables vs SalePrice
- Analyze impact of categorical values on price of house
- Checking correlation of quantitative variables in housing
- Pairplots for numerical variables to understand linear relationship
Data Preparation for Modeling
- Dummy variable encoding (one-hot) for other categorical variables
- Splitting the Data into Training and Testing Sets
- Create X and y sets
- Scaling the variables using StandardScaler (Normalizing)
Ridge Regression
- Tune hyperparameter using GridSearchCV
- Plotting scores to determine optimal alpha
- Build Ridge regression using best alpha
- Prediction using ridge regression
Lasso Regression
- Tune hyperparameter using GridSearchCV
- Plotting scores to determine optimal alpha
- Build Lasso regression model using best alpha
- Prediction using lasso regression
Model Conclusion
-
We will use lasso for final model prediction since:
-
The scores are higher and consistent
-
Model is simpler than ridge (less number of variables)
-
Final score of model:
-
Lasso regression train r2: 0.9281
-
Lasso regression test r2: 0.9122