Skip to content
forked from pbreheny/biglasso

Big Lasso: Extending Lasso Model Fitting to Big Data in R

Notifications You must be signed in to change notification settings

CY-dev/biglasso

 
 

Repository files navigation

Big Lasso: Extend Lasso Model Fitting to Big Data in R

Build Status CRAN_Status_Badge CRAN RStudio mirror downloads

biglasso extends lasso and elastic-net model fitting for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory- and computation-efficient as compared to existing lasso-fitting packages such as glmnet and ncvreg, thus allowing for very powerful big data analysis even with only an ordinary laptop.

Features:

  1. It utilizes memory-mapping files to store the massive data on the disk and only reads those into memory whenever necessary during model fitting, thus can easily handle data-larger-than-RAM cases.
  2. It builds upon the pathwise coordinate descent algorithm combined with "warm start", "active set cycling", and feature screening strategies, which has been proved to be one of the most powerful lasso solvers.
  3. It incorporates some efficient feature screening rules, such as the sequential strong rule (SSR), the sequential EDPP rule (SEDPP), and our newly proposed and more powerful rules - SSR-Dome and SSR-BEDPP.
  4. The underlying algorithm is implemented in C++ and is optimized to be memory- and computation-efficient. Parallel computing via OpenMP is also supported.

Benchmarks:

Simulated data:

  • Packages to be compared: biglasso (1.2-3) (with screen = "SSR-BEDPP"), glmnet (2.0-5), ncvreg (3.7-0), and picasso (0.5-4).
  • Platform: MacBook Pro with Intel Core i7 @ 2.3 GHz with 16 GB RAM.
  • Experiments: solving lasso-penalized linear regression over the entire path of 100 $\lambda$ values equally spaced on the scale of lambda / lambda_max from 0.1 to 1; varying number of observations n and number of features p; 20 replications, the mean (SE) computing time (in seconds) are reported.
  • Data generating model: y = X * beta + 0.1 eps, where X and eps are i.i.d. sampled from N(0, 1).

Real data:

The performance of the packages are also tested using diverse real data sets:

The following table summarizes the mean (SE) computing time of solving the lasso along the entire path of 100 lambda values equally spaced on the scale of lambda / lambda_max from 0.1 to 1 over 20 replications.

Package GENE MNIST GWAS NYT
n=536 n=784 n=313 n=5,000
p=17,322 p=60,000 p=660,495 p=55,000
picasso 1.50 (0.01) 6.86 (0.06) 34.00 (0.47) 44.24 (0.46)
ncvreg 1.14 (0.02) 5.60 (0.06) 31.55 (0.18) 32.78 (0.10)
glmnet 1.02 (0.01) 5.63 (0.05) 23.23 (0.19) 33.38 (0.08)
biglasso 0.54 (0.01) 1.48 (0.10) 17.17 (0.11) 14.35 (1.29)

Big data:

To demonstrate the out-of-core computing capability of biglasso, a 31 GB real data set from a large-scale genome-wide association study is analyzed. The dimensionality of the design matrix is: n = 2898, p = 1,339,511. Note that the size of data is nearly 2x larger than the installed 16 GB of RAM.

Since other three packages cannot handle this data-larger-than-RAM case, we compare the performance of screening rules SSR and SSR-BEDPP based on our package biglasso. Again the entire solution path with 100 lambda values is obtained. The table below summarizes the overall computing time (in minutes) by screening rule SSR (which is what other three packages are using) and our new rule SSR-BEDPP. (Only 1 trial is conducted.)

Rule 1 core 4 cores
SSR 284.56 142.55
SSR-BEDPP 189.21 93.74

Installation:

  • The stable version: install.packages("biglasso")
  • The latest version: devtools::install_github("YaohuiZeng/biglasso")

Report bugs:

News:

  • This package on GitHub has been updated to Version 1.2-4. See details in NEWS.
  • The newest stable version will be submitted to CRAN soon after testing.

About

Big Lasso: Extending Lasso Model Fitting to Big Data in R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 78.5%
  • R 21.3%
  • Other 0.2%