-
Notifications
You must be signed in to change notification settings - Fork 0
Do You Need a Regression Model Selection? ncpen R package.
(Note: In the previous post, we announced the release of ncpen R package. This article explains it without discussing too much detail about statistics.)
Do you have a big data with many variables and need to make a prediction? No worries! Try ncpen R package: throw all the data to ncpen, and it will choose the best model for you.
First things first, a dataset with many variables (100, 1000, 10000 or more) is called as high-dimensional data in statistics. Now, let’s say you have a residential mortgage loan data of 1 million borrowers with 200 variables. You want to predict next year’s prepayment probability of each loan (paying off the loan earlier than the contract). So, you want to find out a regression model for prediction. It may sound good to have many variables, but the potential issue with a high-dimensional data is that it is difficult to decide what variables are important and what variables are not. When we have only a few variables, maybe like 10, it may be relatively easy to find out a good model. We could make a try-and-error approach by adding and removing variables from the regression model. However, if you have 200 variables, the try-and-error approach may not be possible. Think about all the possible combinations of variables. Also, we may want to consider the interactions among the variables. Then the number of variables grows exponentially. Basically, ncpen solves this too-many-variables problem. In ncpen, you don’t need to worry about which variables you should include. The only thing you need to do is to indicate what is the variable you want to predict (y) and what are the variables you will use for the prediction (X). In the above mortgage example, y will be “prepaid” (1 if prepaid and 0 otherwise). X variables will be all other variables (loan size, FICO, loan age, loan purpose, etc.). Of course, you can include all the possible interactions among them to X. Then finally, we have 200 variables including interactions. So this is the setup:
Prepaid or not = prediction with 200 variables.
There is no specific model like classical regressions. Just throw all the data to ncpen. Then, ncpen will suggest the best model by choosing only important variables out of 200 variables. This is the main idea of ncpen. I was a software engineer (mostly programming with C++ and Java) before I pursue a career in real estate finance. I like the simplicity. The users should not worry about the things going on under the hood. Simply put several inputs, press the button and get the results. My two statistician co-authors came to me with this algorithm and wanted to make software. They explained the algorithm. It sounded great but way too complicated at the same time. I told them, “Okay this is a great idea, but no one’s going to use it if it is so complicated.” “Let’s make ncpen simple to use and simple to understand.” That is our goal.
What does ncpen mean? Non-convex penalized estimation. We need to talk a little bit about the statistics. Penalized estimation is an algorithm to select only important variables from a high-dimensional data. If ncpen is a non-convex penalized estimation, then is there a convex penalized estimation as well? Yes. The convex penalty has been more popular than the non-convex penalty (probably because the convex type is easy to implement). However, the convex penalty has some shortcomings. First, its estimated coefficients are biased. The interpretation of the results is potentially misleading. The convex penalty only cares about the prediction power. Also, convex penalty picks up too many variables than necessary. The non-convex penalties resolve these problems. Non-convex penalized estimators are unbiased: the results can be interpreted without misleading. At the same time it selects only necessary variables: produces more parsimonious model than the convex penalty. Further, in our unofficial tests, ncpen’s predictive power is as good as the convex penalty’s (and in some cases even better). Our ncpen is a unified algorithm for non-convex penalties. It supports 7 non-convex penalty algorithms (and 1 convex penalty) as of now.
We posted above mortgage loan example here. In the example, data has 18 variables with 142,343 observations (this data is reduced from the original data for demonstration purpose). After including all the interactions among variables, X set has total 149 variables. If we use classical logit regression, we will need to use all 149 variables, of course. We predict prepayment probability and compare it with the actual outcome. Simple prediction error measure for the classical logit regression was 39.69%. Then we use ncpen. ncpen selects only 32 variables (out of 149) as relevant ones. The prediction error was 38.57%. The error is slightly lower in ncpen (38.57% vs. 39.69%). Still, one can argue almost no difference! However, one big difference is the number of variables used for prediction. If we use ncpen, we need to carry around only 32 variables instead of all 149. If you need a heavy simulation using this regression model, the number of variables will make a huge difference.
In conclusion, ncpen selects the parsimonious best regression model automatically when a dataset has many variables. ncpen package can be applied to any area: finance, marketing, bioengineering… Do you have a high-dimensional data in your field? Please try R install.packages(“ncpen”).