-
Notifications
You must be signed in to change notification settings - Fork 31
Usage
This Wiki page is for a deprecated (beta) version of L0Learn. For the latest CRAN version please visit this link
The Reference Manual contains a detailed description of all the functions available in L0Learn. In what follows, we give a brief tour of the main functions and settings.
The main function in L0Learn is the fit
function which has the following interface and default values:
L0Learn.fit(X,y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=100, NLambda=100)
- X is the data matrix and y is the response vector
- Loss specifies the loss function. Currently, we only officially support "SquaredError". We will be adding support for other loss functions soon.
-
Penalty: The following three penalties are possible "L0", "L0L2", and "L0L1". Note: When using "L0L2" and "L0L1", the number of grid points for the parameter gamma should be specified using the parameter
Ngamma
. Moreover, for "L0L2", the maximum and minimum values of gamma should be specified usingGammaMax
andGammaMin
, respectively. -
Algorithm: The choice of the optimization algorithm can have significant effect on the quality of the solutions. Currently, we support the following two algorithms:
- CD: A Coordinate Descent-type algorithm with all the tweaks and heuristics discussed in our paper.
- CDPSI: This algorithm combines Coordinate Descent and Local Combinatorial Search to escape weak local minima. It typically leads to higher-quality solutions compared to CD, at the cost of additional running time. This is the CD-PSI(1) algorithm introduced in our paper.
- MaxSuppSize specifies the maximum support size in the regularization path after which the algorithm terminates. The toolkit's internals optimize the running time based on this parameter (this choice can affect the type of optimization algorithm used). We recommend experimenting with small values first (e.g., 5% of p) as L0-regularization typically selects a small portion of the features.
- NLambda is the number of Lambda grid points. Note: The actual values of Lambda are data-dependent and are computed automatically by the algorithm.
To demonstrate how L0Learn works we will generate the following synthetic dataset
- A 500x1000 design matrix X with iid standard normal entries
- A 1000x1 vector B with the first 10 entries set to 1 and the rest are zeros.
- A 500x1 vector e with iid standard normal entries
- Set y = XB + e
set.seed(1) # fix the seed to get a reproducible result
X = matrix(rnorm(500*1000),nrow=500,ncol=1000)
B = c(rep(1,10),rep(0,990))
e = rnorm(500)
y = X%*%B + e
We will use L0Learn to estimate B from the data (y,X). First we load L0Learn:
library(L0Learn)
L0Learn can handle the following three types of regularization:
We will start by fitting a simple L0 model and then proceed to the case of L0L2 and L0L1.
To fit a path of solutions for the L0-regularized model with maximal support size of 50 using Algorithm CD we use the command:
fit = L0Learn.fit(X, y, Loss="SquaredError", Penalty="L0", Algorithm="CD", MaxSuppSize=50)
This will generate solutions for a sequence of Lambda values (chosen automatically by the algorithm). To view the sequence of Lambda values along with the associated support sizes, we use:
print(fit)
and we get the following output:
lambda suppsize
1 0.068285500 1
2 0.055200200 2
3 0.049032300 3
4 0.040072500 6
5 0.038602800 7
6 0.037265300 8
7 0.032514200 10
8 0.001142920 11
9 0.000821221 13
10 0.000702287 14
11 0.000669519 15
12 0.000489943 17
13 0.000412565 22
14 0.000404252 24
15 0.000369975 27
16 0.000357211 31
17 0.000331164 40
18 0.000284271 42
19 0.000240881 50
The sequence of lambda values is a member of the fit object and can be accessed using fit$lambda
. For example, the lambda corresponding to the first solution in the path can be retrieved using fit$lambda[1]
.
To print the estimated B for a particular value of lambda, we use the function coef(fit,lambda)
which takes the object fit as the first parameter and the value of lambda
corresponding to the solution as the second parameter. Note that (in this example) the solution at index 7 has a support size of 10. We can retrieve this solution using the coef
function as follows:
coef(fit,lambda=fit$lambda[7])
to get the following output:
1001 x 1 sparse Matrix of class "dgCMatrix"
Intercept 0.01052402
V1 1.01601044
V2 1.01830944
V3 1.00606875
V4 0.98309180
V5 0.97389883
V6 0.96148076
V7 1.00990714
V8 1.08535507
V9 1.02686930
V10 0.94235619
V11 .
V12 .
V13 .
V14 .
V15 .
V16 .
V17 .
V18 .
V19 .
V20 .
.
.
.
The output is a sparse vector of type dgCMatrix
. The first element in the vector is intercept and the rest are the B coefficients. Aside from the intercept, the only non-zeros in the above solution are coordinates V1, V2, V3, ..., V10, which are the non-zero coordinates in the true support (used to generated the data). Thus, this solution successfully recovers the true support. We can also make predictions using a specific solution in the grid using the function predict(fit,newx,lambda)
where newx is a testing sample (vector or matrix). For example, to predict the response for the samples in the data matrix X using the solution at index 7 we call the prediction function
predict(fit,X,lambda=fit$lambda[7])
We have demonstrated the simple case of using an L0 penalty alone. We can also run L0Learn using the L0L2 and L0L1 penalties, which typically leads to better predictive models due to the shrinkage component. For example, to fit a model using the L0L2 penalty for a two-dimensional grid of lambda and gamma values, we call L0Learn.fit
with Penalty="L0L2"
as follows:
fit = L0Learn.fit(X, y, Loss="SquaredError", Penalty="L0L2", NGamma = 10, GammaMin = 0.0001, GammaMax = 10, Algorithm="CD", MaxSuppSize=50)
Note that in the above call, we set the number of gamma
points in the grid to 10 using the NGamma
parameter. We also specified the values for maximum and minimum values of gamma
using GammaMax
and GammaMin
. L0Learn will generate a grid of 10 gamma values equi-spaced on the logarithmic scale between GammaMax
and GammaMin
. Similar to the case of L0, we can print a summary of the regularization path using print(fit)
, which leads to the following output:
lambda gamma suppsize
1 0.003251690 1.000000e+01 1
2 0.002776000 1.000000e+01 2
3 0.002687010 1.000000e+01 3
4 0.002577800 1.000000e+01 3
5 0.002062240 1.000000e+01 6
6 0.001861720 1.000000e+01 7
7 0.001710400 1.000000e+01 8
8 0.001644570 1.000000e+01 9
9 0.001153900 1.000000e+01 10
10 0.000482462 1.000000e+01 10
11 0.000385970 1.000000e+01 12
12 0.000374391 1.000000e+01 14
13 0.000363159 1.000000e+01 15
14 0.000352264 1.000000e+01 15
15 0.000281811 1.000000e+01 19
16 0.000273357 1.000000e+01 19
17 0.000218686 1.000000e+01 29
18 0.000212125 1.000000e+01 31
19 0.000205761 1.000000e+01 35
20 0.000199589 1.000000e+01 38
21 0.000193601 1.000000e+01 38
22 0.000154881 1.000000e+01 66
23 0.010401300 2.782559e+00 1
24 0.008879650 2.782559e+00 2
25 0.008594990 2.782559e+00 2
26 0.006875990 2.782559e+00 6
27 0.005955120 2.782559e+00 7
.
.
.
The sequence of gamma values is a member of the fit object, and it can extracted using fit$gamma
. Moreover, in this case, the ith item in the list fit$lambda
is the sequence of lambda values generated for fit$gamma[i]
--- it can be accessed using fit$lambda[[i]]
. We can extract the solution(s) at a specific sequence of lambda(s) and gamma using the coef(fit,lambda,gamma)
function. For example, the solution with index 9 in the above output can be extracted using coef(fit,lambda=0.0011539, gamma=10)
or equivalently using coef(fit,lambda=fit$lambda[[1]][9], gamma=fit$gamma[1])
. Similarly, predictions at a specific solution can be made using predict(fit, newx, lambda, gamma)
. We note that the previous discussion on L0L2
also applies to L0L1
by just changing the Penalty parameter in L0Learn.fit
to L0L1
.
The L0Learn.fit
function in the previous examples was executed with the parameter Algorithm="CD"
, which indicates that coordinate descent (CD) should be used. A more elaborate algorithm based on combinatorial search can be used by setting the parameter Algorithm="CDPSI"
in the call to L0Learn.fit
. CDPSI typically leads to higher-quality solutions compared to CD, especially when the features are highly correlated. CDPSI is slower than CD, however, for typical applications it terminates in the order of seconds.
By default, L0Learn selects the sequence of lambda values in an efficient manner to avoid wasted computation (as close lambda values can typically lead to the same solution). Advanced users can change this default behavior and supply their own sequence of lambda values. This can be done by setting the AutoLambda
parameter to FALSE
and supplying the lambda values through the parameter LambdaGrid
. Specifically, the value assigned to LambdaGrid
should be a list with the same length as NGamma
in case of L0L2/L0L1 and of length 1 in case of L0. The ith element in LambdaGrid
should be a decreasing sequence of lambda values which are used by the algorithm for the ith value of gamma. For example, to fit an L0 model with the sequence of user-specified lambda values: 1, 1e-1, 1e-2, 1e-3, 1e-4, we run the following:
UserLambda = list()
UserLambda[[1]] = c(1, 1e-1, 1e-2, 1e-3, 1e-4)
fit = L0Learn.fit(X, y, Loss="SquaredError", Penalty="L0", Algorithm="CD", AutoLambda=FALSE, LambdaGrid=UserLambda)
Running print(fit)
we get
lambda suppsize
1 1e+00 0
2 1e-01 0
3 1e-02 10
4 1e-03 11
5 1e-04 129
Note that the lambda values above are the desired values. For L0L2 and L0L1 penalties, the same can be done where the LambdaGrid parameter should be assigned a list of size NGamma
as discussed previously (unlike this case where we used a list of size 1).
We will demonstrate how to use K-fold cross-validation (CV) to select the optimal values of the tuning parameters lambda
and gamma
. To perform CV, we call the L0Learn.cvfit
function, which takes the same parameters as L0Learn.fit
, in addition to the number of folds using the Nfolds
parameter and a seed value using the Seed
parameter (this is used when randomly shuffling the data before performing CV).
For example, to perform 5-fold CV using the L0L2
penalty (over a range of 5 gamma
values), we run:
cvfit = L0Learn.cvfit(X, y, Nfolds=5, Seed=1, Loss="SquaredError", Penalty="L0L2", NGamma=5, GammaMin=0.0001, GammaMax=0.1, Algorithm="CD", MaxSuppSize=50)
The cross-validation errors can be accessed using the cvmeans
attribute of cvfit
: cvfit$cvmeans
is a list where the ith element, cvfit$cvmeans[[i]]
, stores the cross-validation errors for the ith value of gamma (cvfit$gamma[i]
). To find the minimum cross-validation error for every gamma
, we call the min
function for every element in the list cvfit$cvmeans
, as follows
lapply(cvfit$cvmeans, min)
This leads to the following output:
[[1]]
[1] 1.13005
[[2]]
[1] 1.000699
[[3]]
[1] 0.9900166
[[4]]
[1] 0.9899542
[[5]]
[1] 0.9900055
The above output indicates that the 4th value of gamma achieves the lowest CV error (=0.9899542
). We can plot the CV errors against log(lambda)
for the 4th value of gamma, i.e., gamma = cvfit$gamma[4]
, using:
plot(cvfit, gamma= cvfit$gamma[4])
which generates:
To extract the optimal lambda
(i.e., the one with minimum CV error) in this plot:
GammaIndex = 4 # The index of the optimal gamma
OptimalIndex = which.min(cvfit$cvmeans[[GammaIndex]])
OptimalLambda = cvfit$lambda[[GammaIndex]][OptimalIndex]
To print the solution corresponding to the optimal gamma/lambda pair:
coef(cvfit, lambda=OptimalLambda, gamma=cvfit$gamma[4])
which leads to the following output:
1001 x 1 sparse Matrix of class "dgCMatrix"
Intercept 0.01044999
V1 1.01472609
V2 1.01711151
V3 1.00494628
V4 0.98208007
V5 0.97274702
V6 0.96057358
V7 1.00895158
V8 1.08392095
V9 1.02580618
V10 0.94108163
V11 .
V12 .
The optimal solution (above) selected by cross-validation correctly recovers the true vector of coefficients used to generated the model. Finally, we note that if the penalty is L0
(instead of L0L2
), then cvfit$cvmeans
will be a sequence cross-validation errors, where cvfit$cvmeans[i]
corresponds to cvfit$lamnbda[i]
(refer to the Reference Manual for more details).