some cool R functions that I either wrote, stole, or modified
rfunctions is not available on CRAN, but can be installed using the R package devtools. rfunctions can be installed with the following R code:
devtools::install_github("jaredhuling/rfunctions")
library(rfunctions)
A project I've been working on requires fast evaluation of X'X for a design matrix X. I found a great example in the paper for RcppEigen by Douglas Bates and Dirk Eddelbuettel for just such a thing. RcppEigen provides a simple and effective interface between R and the blazing-fast Eigen C++ library for numerical linear algebra. Their example uses inline, a nice tool for inline C++ code in R, and I a made a proper R function from that. The following showcases the speed of Eigen. Note that since X'X is symmetric, we only have to compute half of the values, which further reduces computation time.
n.obs <- 10000
n.vars <- 100
x <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
library(microbenchmark)
microbenchmark(crossprodcpp(x), crossprod(x), times = 25L)
## Unit: milliseconds
## expr min lq median uq max neval
## crossprodcpp(x) 11.88 13.09 15.33 16.82 19.34 25
## crossprod(x) 51.32 56.07 60.98 62.41 72.67 25
all.equal(crossprodcpp(x), crossprod(x))
## [1] TRUE
crossprodcpp
can also compute a weighted cross product X'WX where W is a diagonal weight matrix
ru <- runif(n.obs)
weights <- ru * (1 - ru)
microbenchmark(crossprodcpp(x, weights), crossprod(x, weights * x), times = 25L)
## Unit: milliseconds
## expr min lq median uq max neval
## crossprodcpp(x, weights) 15.93 16.34 16.68 17.2 24.61 25
## crossprod(x, weights * x) 105.36 106.59 110.08 113.9 134.62 25
all.equal(crossprodcpp(x, weights), crossprod(x, weights * x))
## [1] TRUE
The Lanczos algorithm is a well-known method for fast computation of extremal eigenvalues. The Golub-Kahan-Lanczos bidiagonalization algorithm is an extension of this to approximate the largest singular values of a matrix X from below. The function gklBidiag
approximates the largest singular value of a matrix. Since GKL bidiagonalization is initialized from a random vector, we can compute a probabilistic upper bound for the singular value. The following compares the speed of gklBidiag
and the implementation in the popular Fortran library PROPACK found in the svd package
library(svd)
v <- runif(ncol(x)) #initialization for GKL-bidiag
opts <- list(kmax = 30L)
microbenchmark(gklBidiag(x, v, maxit = 30L), propack.svd(x, neig = 1L, opts = opts))
## Unit: milliseconds
## expr min lq median uq max
## gklBidiag(x, v, maxit = 30L) 31.61 32.08 32.43 35.12 67.78
## propack.svd(x, neig = 1L, opts = opts) 230.24 234.09 237.05 242.29 329.95
## neval
## 100
## 100
gklBidiag(x, v, maxit = 30L)$d - propack.svd(x, neig = 1L, opts = opts)$d
## [1] -2.52e-11
As gklBidiag
also works on sparse matrices (of the SparseMatrix
class from the Matrix package), I can showcase another function in rfunctions, simSparseMatrix
, which unsurprisingly simulates matrices with very few nonzero values. The nonzero values can either be all 1's or generated from a normal distribution. The level of sparsity of the simulated matrix can be specified
n.obs <- 1e+05
n.vars <- 1000
# simulate a very sparse matrix (this matrix has many zeros and few ones)
x.s.b <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = T)
x.s.c <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)
v <- runif(n.vars)
# reorthogonalization sometimes leads to higher accuracy. it helps correct
# for floating-point errors
microbenchmark(gklBidiag(x.s.b, v, maxit = 10L, 0L), gklBidiag(x.s.c, v, maxit = 10L,
0L))
## Unit: milliseconds
## expr min lq median uq max
## gklBidiag(x.s.b, v, maxit = 10L, 0L) 82.95 84.54 85.59 92.48 246.8
## gklBidiag(x.s.c, v, maxit = 10L, 0L) 83.21 85.00 87.00 101.99 243.1
## neval
## 100
## 100
gklBidiag(x.s.b, v, maxit = 10L, 0L)$d
## [1] 104.9
gklBidiag(x.s.c, v, maxit = 10L, 0L)$d
## [1] 35.29
The speed of MASS::ginv()
leaves much to be desired, as it calls svd()
in order to compute the Moore-Penrose generalized inverse of a matrix. I came across this paper, which provides a faster algorithm for the M-P inverse. I've implemented it using RcppEigen. This is useful for least squares problems when the matrix is less than full rank, like below
n <- 1000
p <- 500
x <- matrix(rnorm(n * (p - 1)), n, p - 1)
x <- cbind(x, rowMeans(x))
## compute X'X
xpx <- crossprodcpp(x)
library(MASS)
## compute generalized inverse of X'X
microbenchmark(geninv(xpx), ginv(xpx))
## Unit: milliseconds
## expr min lq median uq max neval
## geninv(xpx) 185.9 190.4 192.1 199.6 256.2 100
## ginv(xpx) 498.7 511.6 522.2 534.7 711.0 100
inv <- geninv(xpx)
inv2 <- ginv(xpx)
## check if we have computed the M-P generalized inverse
all.equal(xpx, xpx %*% inv %*% xpx)
## [1] TRUE
all.equal(inv, inv2)
## [1] TRUE
This may seem pointless, but I wrote functions to add and subtract matrices. It turns out my functions are faster than using the +
and -
operators. I'm sure someone will be quick to point out why using my add()
and subtract()
functions is silly and a bad idea.
n.obs <- 1e+05
n.vars <- 500
A <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)
B <- simSparseMatrix(sparsity = 0.99, dim = c(n.obs, n.vars), boolean = F)
microbenchmark(add(A, B), A + B, times = 25L)
## Unit: milliseconds
## expr min lq median uq max neval
## add(A, B) 37.59 40.08 40.93 44.32 161.2 25
## A + B 92.69 100.69 105.72 110.32 218.2 25
microbenchmark(subtract(A, B), A - B, times = 25L)
## Unit: milliseconds
## expr min lq median uq max neval
## subtract(A, B) 37.42 39.22 40.22 45.8 155.2 25
## A - B 94.09 97.81 108.23 112.1 120.9 25
all.equal(add(A, B), A + B)
## [1] TRUE
all.equal(subtract(A, B), A - B)
## [1] TRUE
The add()
and subtract()
methods for dense matrices are slower than the corresponding operators, so they're only worth using when you have sparse matrices.
n.obs <- 1000
n.vars <- 1000
A <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
B <- matrix(rnorm(n.obs * n.vars), n.obs, n.vars)
microbenchmark(add(A, B), A + B)
## Unit: milliseconds
## expr min lq median uq max neval
## add(A, B) 4.801 6.286 6.432 7.201 24.76 100
## A + B 1.677 3.018 3.207 3.472 19.32 100
microbenchmark(subtract(A, B), A - B)
## Unit: milliseconds
## expr min lq median uq max neval
## subtract(A, B) 4.838 6.260 6.347 6.512 19.38 100
## A - B 1.845 2.996 3.054 3.198 20.56 100
all.equal(add(A, B), A + B)
## [1] TRUE
all.equal(subtract(A, B), A - B)
## [1] TRUE