biglasso
extends lasso and elastic-net model fitting for ultrahigh-dimensional, multi-gigabyte
data sets that cannot be loaded into memory. It utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory- and computation-efficient as compared to existing lasso-fitting packages such as glmnet and ncvreg, thus allowing for very powerful big data analysis even with only an ordinary laptop.
- It utilizes memory-mapping files to store the massive data on the disk and only reads those into memory whenever necessary during model fitting, thus can easily handle data-larger-than-RAM cases.
- It builds upon the pathwise coordinate descent algorithm combined with "warm start", "active set cycling", and feature screening strategies, which has been proved to be one of the most powerful lasso solvers.
- It incorporates some efficient feature screening rules, such as the sequential strong rule (SSR), the sequential EDPP rule (SEDPP), and our newly proposed and more powerful rules - SSR-Dome and SSR-BEDPP.
- The underlying algorithm is implemented in C++ and is optimized to be memory- and computation-efficient. Parallel computing via OpenMP is also supported.
-
Packages to be compared:
biglasso (1.2-3)
(withscreen = "SSR-BEDPP"
),glmnet (2.0-5)
,ncvreg (3.7-0)
, andpicasso (0.5-4)
. - Platform: MacBook Pro with Intel Core i7 @ 2.3 GHz with 16 GB RAM.
-
Experiments: solving lasso-penalized linear regression over the entire path of 100
$\lambda$ values equally spaced on the scale oflambda / lambda_max
from 0.1 to 1; varying number of observationsn
and number of featuresp
; 20 replications, the mean (SE) computing time (in seconds) are reported. -
Data generating model:
y = X * beta + 0.1 eps
, whereX
andeps
are i.i.d. sampled fromN(0, 1)
.
The performance of the packages are also tested using diverse real data sets:
- Breast cancer gene expression data (GENE);
- MNIST handwritten image data (MNIST);
- Cardiac fibrosis genome-wide association study data (GWAS);
- Subset of New York Times bag-of-words data (NYT).
The following table summarizes the mean (SE) computing time of solving the lasso along the entire path of 100 lambda
values equally spaced on the scale of lambda / lambda_max
from 0.1 to 1 over 20 replications.
Package | GENE | MNIST | GWAS | NYT |
---|---|---|---|---|
n=536 |
n=784 |
n=313 |
n=5,000 |
|
p=17,322 |
p=60,000 |
p=660,495 |
p=55,000 |
|
picasso | 1.50 (0.01) | 6.86 (0.06) | 34.00 (0.47) | 44.24 (0.46) |
ncvreg | 1.14 (0.02) | 5.60 (0.06) | 31.55 (0.18) | 32.78 (0.10) |
glmnet | 1.02 (0.01) | 5.63 (0.05) | 23.23 (0.19) | 33.38 (0.08) |
biglasso | 0.54 (0.01) | 1.48 (0.10) | 17.17 (0.11) | 14.35 (1.29) |
To demonstrate the out-of-core computing capability of biglasso
, a 31 GB real data set from a large-scale genome-wide association study is analyzed. The dimensionality of the design matrix is: n = 2898, p = 1,339,511
. Note that the size of data is nearly 2x larger than the installed 16 GB of RAM.
Since other three packages cannot handle this data-larger-than-RAM case, we compare the performance of screening rules SSR
and SSR-BEDPP
based on our package biglasso
. Again the entire solution path with 100 lambda
values is obtained. The table below summarizes the overall computing time (in minutes) by screening rule SSR
(which is what other three packages are using) and our new rule SSR-BEDPP
. (Only 1 trial is conducted.)
Rule | 1 core | 4 cores |
---|---|---|
SSR | 284.56 | 142.55 |
SSR-BEDPP | 189.21 | 93.74 |
- The stable version:
install.packages("biglasso")
- The latest version:
devtools::install_github("YaohuiZeng/biglasso")
- open an issue or send an email to Yaohui Zeng at yaohui-zeng@uiowa.edu
- This package on GitHub has been updated to Version 1.2-4. See details in NEWS.
- The newest stable version will be submitted to CRAN soon after testing.