Skip to content

Regression by diviging data into bins and fitting different degree of polynomials on each bin.

Notifications You must be signed in to change notification settings

aps19/B-Spline-Regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

B-Spline-Regression

To overcome the limitations of Linear Regression we can use an improved technique in which we can split the training data into multiple bins and fit each bin with a different model. This technique is known as the Regression spline.

Knots and Bins

The points where division occurs are called Knots and the intervals or pieces of data are called bins. The functions used to fit each bin are called Piecewise functions. These piecewise functions can be of linear of polynomial types, here we have used a cubic polyomial.

Piecewise Function

C0(X) = I(X < k0)
C1(X) = I(k0 < k1)
C2(X) = I(k1 < k2)
.
.
.
Cm(X) = I(km)
where k0, k1, k2......km are the cut points and I(condition) is the function that return 1 when condition=True.
Instead of fitting the same function over different bins of X. We fitted a cubic Piecewise Polynomial over 5 bins.

Basis Function

Idea behind using low degree polynomial is to avoid high oscillations of the curve. Instead of treating the functions that are applied to the bins as linear, it would be even more efficient to treat them as non-linear. To do this, a very general family of functions is applied to the variable. This family should nor be too flexible to over-fit neither be too rigid to under-fit. These families of functions are called basic functions.

Dataset

The Dataset used in this project is called Boston House Price Prediction(easily available on Kaggle). This dataset contains 506 Observations and 14 Variables.

  1. CRIM (per capital crime rate by town)
  2. ZN (proportion of residential land zoned for lots over 25,000 sq.ft.)
  3. INDUS (proportion of non-retail business acres per town)
  4. CHAS (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise))
  5. NOX (nitric oxides concentration (parts per 10 million)
  6. RM (average number of rooms per dwelling)
  7. AGE (proportion of owner-occupied units built prior to 1940)
  8. DIS (weighted distances to five Boston employment centers)
  9. RAD (index of accessibility to radial highways)
  10. TAX (full-value property-tax rate per 10,000 USD)
  11. PTRATIO (pupil-teacher ratio by town)
  12. B (1000(Bk — 0.63)2 where Bk is the proportion of blacks by town)
  13. LSTAT (percent lower status of the population)
  14. MEDV (Housing Prices)

About

Regression by diviging data into bins and fitting different degree of polynomials on each bin.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published