This repository is aimed at predicting the housing prices in Boston by generating the optimal regression model.
THe question source has been taken from the following link: https://docs.google.com/document/d/1_K2pjLJ15c4kRcyNdQ6kBvOjc-pcdBfP6v88JYtt17Y/pub?embedded=true
The objective is to develop an optimal model that can predict the housing prices in Boston given the training data. For this work, the follwing software and associated packages were used:
- Python 3.5
- Jupyter Notebook
- Numpy
- Scikit-Learn
- Matplotlib
The data is loaded through the load_boston()
function of scikit learn.
bostonData = load_boston()
The various statistical measures of the data are obtained using the relevant functions in numpy.
The number of houses is 506.
The number of features for each house is 13.
The maximum price among houses is $500,000.00.
The minimum price among houses is $50,000.00.
The average price among houses is $225,328.06.
The standard deviation of prices among houses is $91,880.12.
The model complexity curves are the curves that are genrated when the training and test error are recorded as the complexity of the model is varied. The following curves depict the generated curves. The parameter is varied from 2 to the limit specified by the user. This is because decision trees require a minimum depth of 2.
As it can be oberved in the case for Decision Tree Regressor, the model starts to overfit the data after the max_depth
parameter is
increased beyond 5. This is because the test error begins to show a declining trend and the training error approaches an R^2 score of 1. In the second image, the model complexity for the kNN regressor is shown. In this case, when the number of neighbors n_neighbors
is increased beyond 3, the training error and the test error begin decrease. For the AdaBoost regressor, the training error and test error settle and oscillate around ~0.9 when the number of the base learner n_estimators
reaches 10. Thusit can be conlsuded that the best parameters for the different regressors can be thought to be around:
- Decision Tree regressor:
max_depth
= 5 - kNN regressor:
n_neighbors
= 3 - AdaBoost regressor:
n_estimators
= 10
The learning curves are generated by fixing the best parameter deduced through the model compleixyt curves and varying the training set size to observe the training error and test error. The generated learning curves are as shown below:
The learning curves shows that training and test error come close to each other when the training set is about ~250. However, the difference between the curves is the least when the full training set is used with the best parameter used.
Find the best guess price for the given feature set: [11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, 1.385, 24, 680.0, 20.20, 332.09, 12.13].
The best guess price varies when a different regressor is used. For each regressor, GirdSearchCV()
function is used to perform a grid-search on the appropriate gird of parameters. The following reults were obtained:
- Decision tree regressor
Recommended selling price is $216,297.44
- kNN regressor
Recommended selling price is $204,000.00
- AdaBoost regressor
Recommended selling price is $203,452.99
The expected price that the seller can expect to sell the house is $207,916.81.