Experiences from trying to get GLMPCA to run with the book #7

LTLA · 2020-06-01T02:31:32Z

Consider the following stretch of code:

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- logNormCounts(sce)

library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

library(scry)
out <- GLMPCA(sce[hvgs,], L=50, sz=sizeFactors(sce))

The most obvious issue is that I can't get it to work, GLMPCA conks out with:

Error in glmpca(Y = assay(object, assay), L = L, fam = fam, ctl = ctl,  : 
  Numerical divergence (deviance no longer finite), try increasing the penalty to improve stability of optimization.

Increasing the penalty to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.

There are also some other quality-of-life issues that you might consider addressing:

Add a subset= argument to GLMPCA() so that we don't have to row-subset the entire SCE in the input, especially when only one of the matrices is of interest.
Add a SingleCellExperiment method that defaults sz=sizeFactors(x).
What he said in Assay should default to counts in SummarizedExperiment objects #5.

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scRNAseq_2.3.2              scuttle_0.99.8             
 [3] scran_1.17.0                SingleCellExperiment_1.11.2
 [5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1        
 [7] matrixStats_0.56.0          Biobase_2.49.0             
 [9] GenomicRanges_1.41.1        GenomeInfoDb_1.25.0        
[11] IRanges_2.23.5              S4Vectors_0.27.7           
[13] BiocGenerics_0.35.2         scry_1.1.0                 

loaded via a namespace (and not attached):
 [1] httr_1.4.1                    edgeR_3.31.0                 
 [3] BiocSingular_1.5.0            bit64_0.9-7                  
 [5] AnnotationHub_2.21.0          DelayedMatrixStats_1.11.0    
 [7] shiny_1.4.0.2                 assertthat_0.2.1             
 [9] interactiveDisplayBase_1.27.2 statmod_1.4.34               
[11] BiocManager_1.30.10           BiocFileCache_1.13.0         
[13] dqrng_0.2.1                   blob_1.2.1                   
[15] GenomeInfoDbData_1.2.3        yaml_2.2.1                   
[17] BiocVersion_3.12.0            pillar_1.4.4                 
[19] RSQLite_2.2.0                 lattice_0.20-41              
[21] glue_1.4.1                    limma_3.45.0                 
[23] digest_0.6.25                 promises_1.1.0               
[25] XVector_0.29.1                htmltools_0.4.0              
[27] httpuv_1.5.2                  Matrix_1.2-18                
[29] pkgconfig_2.0.3               zlibbioc_1.35.0              
[31] purrr_0.3.4                   xtable_1.8-4                 
[33] glmpca_0.1.0                  later_1.0.0                  
[35] BiocParallel_1.23.0           tibble_3.0.1                 
[37] ellipsis_0.3.1                magrittr_1.5                 
[39] crayon_1.3.4                  mime_0.9                     
[41] memoise_1.1.0                 tools_4.0.0                  
[43] lifecycle_0.2.0               locfit_1.5-9.4               
[45] irlba_2.3.3                   AnnotationDbi_1.51.0         
[47] compiler_4.0.0                rsvd_1.0.3                   
[49] rlang_0.4.6                   grid_4.0.0                   
[51] RCurl_1.98-1.2                BiocNeighbors_1.7.0          
[53] rappdirs_0.3.1                igraph_1.2.5                 
[55] bitops_1.0-6                  ExperimentHub_1.15.0         
[57] DBI_1.1.0                     curl_4.3                     
[59] R6_2.4.1                      dplyr_0.8.5                  
[61] fastmap_1.0.1                 bit_1.1-15.2                 
[63] Rcpp_1.0.4.6                  vctrs_0.3.0                  
[65] dbplyr_1.4.3                  tidyselect_1.1.0

The text was updated successfully, but these errors were encountered:

willtownes · 2020-06-01T15:13:33Z

Hi Aaron, thanks for the feedback. The problems with numerical stability for large L and tuning issues of the penalty (+ computational speed) are all known issues with glmpca we are actively working on. I am pretty close to having something ready to release that scales better for larger L values. I am now testing it to see if it can work with large numbers of cells as well. I will post something here when I release it to CRAN, and any improvements should pass through to the scry interface automatically. Also I agree with all of your other suggestions. Sorry for the frustrations!

willtownes · 2020-06-01T15:14:45Z

by the way, we would recommend using the deviance for gene filtering rather than hvgs

LTLA · 2020-06-01T21:28:23Z

On what basis?

willtownes · 2020-06-01T22:00:53Z

it's more consistent with the multinomial model behind glm-pca. Also it gave better empirical results for eg clustering: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6/figures/5 , see also more comprehensive comparison by Germain et al where they recommend deviance for filtering: https://www.biorxiv.org/content/10.1101/2020.02.02.930578v3

However, this is totally optional and doesn't detract from your main points. You can use any filtering you choose as input to GLM-PCA. It's a separate issue from the speed and numerical problems.

LTLA · 2020-06-01T23:58:44Z

Seems like it would be tricky to choose an appropriate dispersion for the deviance. If you used a Poisson GLM, you'd just end up with a strong mean-dependent trend if the true dispersion was a constant non-zero value across all genes. Unless you're doing your own dispersion estimation, but then how would you do that before you've accounted for the structure in the GLM-PCA?

Yes, I also did read that paper. But they recommended imputation as well and I didn't know what to make of that, so I just left it alone and moved on.

willtownes · 2020-07-18T19:11:03Z

Hi Aaron, glmpca 0.2.0 has now been released. It should be faster and more numerically stable. Please try running your experiments again and let me know if things improve or if there are further problems. Thanks!

LTLA · 2020-07-20T02:18:01Z

Excellent. I will give it a spin sometime during the week.

LTLA · 2020-07-27T05:28:20Z

It finishes, which is good. The example in my original post took 262 seconds, which is... tolerable, for the purposes of putting it in the book. I haven't actually tested the output yet to see whether it solves the problem that I was hoping it would solve.

I'll test out the current state, which is definitely better than what it was before. Hopefully it turns out to do what I wanted; but if it does, it'll need to be a fair bit faster for mainstream use. Maybe a minute for 10k cells, which sounds pretty reasonable.

There's probably some low-hanging fruit with BiocParallelization and DelayedArraying, especially if you can make use of the parallelized crossprod provided by DelayedArray (via the Matrix generic). Or maybe there's some minibatching or some other approximation that I'm missing? I do notice a glmpca:::create_minibatches when I was poking around in there.

willtownes · 2020-07-27T14:10:40Z

Yes, minibatch is now supported. You can do minibatch='memoized' to use cached sufficient statistics to compute gradients or minibatch='stochastic' for stochastic gradient. The advantage of those methods is to save on memory, not necessarily to be faster, but anecdotally I have seen the stochastic gradient converge faster than full gradient (the default minibatch='none'). The problem with the stochastic gradient is it's hard to assess convergence due to the noise in the estimator, which causes it to often run for many excess iterations after it's already effectively converged. To set the minibatch size use ctl=list(batch_size=1000).

Also, for the Zeisel dataset it might be worth trying fam='nb' instead of the default fam='poi' because there are a couple of genes that have really huge counts compared to all the others, and negative binomial likelihood is a bit more numerically stable than Poisson in that scenario.

Adding support for DelayedArray/ HDF5Array is on my to-do list ( willtownes/glmpca#18 ). Thanks for the suggestion about parallelizing the crossprod operation ( now included as willtownes/glmpca#22 )

LTLA · 2020-08-15T17:31:41Z

This is the call I've currently got in the book:

sce.scry <- GLMPCA(sce.zeisel[top.hvgs,], fam="nb", 
    L=20, sz=sizeFactors(sce.zeisel))

One improvement would be to offer a subsetting argument so that I don't have to subset the SCE as input. This would ensure that the output SCE has the same dimensionality, with the only difference being a new redDim entry.

Another improvement would be to provide a SCE method where sz=sizeFactors(object) by default.

The function is still achingly slow, though. I wonder whether there is more that can be done to speed it up. For example, using the PCA on the log-values to provide the initial U and V when init= is not supplied, maybe that would get start you closer to the target space. Or sharing information across iterations with different learning rates.

(I don't really understand why multiple learning rates are even required; I'm assuming the learning rate is somehow related to the step size per iteration, but surely this could be dynamically adjusted within a run rather than restarting the run with a different rate? We use a similar approach in edgeR to guarantee that each step decreases deviance.)

willtownes · 2020-08-17T17:43:53Z

thanks for these good suggestions.

LTLA mentioned this issue Jun 4, 2020

Include count-based dimensionality reduction options Bioconductor/OrchestratingSingleCellAnalysis#30

Closed

willtownes mentioned this issue Jul 27, 2020

Speed up matrix operations with parallelization willtownes/glmpca#22

Open

LTLA mentioned this issue Aug 15, 2020

Experiences with the book, round 2 #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiences from trying to get GLMPCA to run with the book #7

Experiences from trying to get GLMPCA to run with the book #7

LTLA commented Jun 1, 2020 •

edited

Loading

willtownes commented Jun 1, 2020

willtownes commented Jun 1, 2020

LTLA commented Jun 1, 2020

willtownes commented Jun 1, 2020

LTLA commented Jun 1, 2020

willtownes commented Jul 18, 2020

LTLA commented Jul 20, 2020

LTLA commented Jul 27, 2020

willtownes commented Jul 27, 2020

LTLA commented Aug 15, 2020

willtownes commented Aug 17, 2020

Experiences from trying to get GLMPCA to run with the book #7

Experiences from trying to get GLMPCA to run with the book #7

Comments

LTLA commented Jun 1, 2020 • edited Loading

willtownes commented Jun 1, 2020

willtownes commented Jun 1, 2020

LTLA commented Jun 1, 2020

willtownes commented Jun 1, 2020

LTLA commented Jun 1, 2020

willtownes commented Jul 18, 2020

LTLA commented Jul 20, 2020

LTLA commented Jul 27, 2020

willtownes commented Jul 27, 2020

LTLA commented Aug 15, 2020

willtownes commented Aug 17, 2020

LTLA commented Jun 1, 2020 •

edited

Loading