Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiences from trying to get GLMPCA to run with the book #7

Open
LTLA opened this issue Jun 1, 2020 · 11 comments
Open

Experiences from trying to get GLMPCA to run with the book #7

LTLA opened this issue Jun 1, 2020 · 11 comments

Comments

@LTLA
Copy link

LTLA commented Jun 1, 2020

Consider the following stretch of code:

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- logNormCounts(sce)

library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

library(scry)
out <- GLMPCA(sce[hvgs,], L=50, sz=sizeFactors(sce))

The most obvious issue is that I can't get it to work, GLMPCA conks out with:

Error in glmpca(Y = assay(object, assay), L = L, fam = fam, ctl = ctl,  : 
  Numerical divergence (deviance no longer finite), try increasing the penalty to improve stability of optimization.

Increasing the penalty to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.

There are also some other quality-of-life issues that you might consider addressing:

Session information
R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scRNAseq_2.3.2              scuttle_0.99.8             
 [3] scran_1.17.0                SingleCellExperiment_1.11.2
 [5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1        
 [7] matrixStats_0.56.0          Biobase_2.49.0             
 [9] GenomicRanges_1.41.1        GenomeInfoDb_1.25.0        
[11] IRanges_2.23.5              S4Vectors_0.27.7           
[13] BiocGenerics_0.35.2         scry_1.1.0                 

loaded via a namespace (and not attached):
 [1] httr_1.4.1                    edgeR_3.31.0                 
 [3] BiocSingular_1.5.0            bit64_0.9-7                  
 [5] AnnotationHub_2.21.0          DelayedMatrixStats_1.11.0    
 [7] shiny_1.4.0.2                 assertthat_0.2.1             
 [9] interactiveDisplayBase_1.27.2 statmod_1.4.34               
[11] BiocManager_1.30.10           BiocFileCache_1.13.0         
[13] dqrng_0.2.1                   blob_1.2.1                   
[15] GenomeInfoDbData_1.2.3        yaml_2.2.1                   
[17] BiocVersion_3.12.0            pillar_1.4.4                 
[19] RSQLite_2.2.0                 lattice_0.20-41              
[21] glue_1.4.1                    limma_3.45.0                 
[23] digest_0.6.25                 promises_1.1.0               
[25] XVector_0.29.1                htmltools_0.4.0              
[27] httpuv_1.5.2                  Matrix_1.2-18                
[29] pkgconfig_2.0.3               zlibbioc_1.35.0              
[31] purrr_0.3.4                   xtable_1.8-4                 
[33] glmpca_0.1.0                  later_1.0.0                  
[35] BiocParallel_1.23.0           tibble_3.0.1                 
[37] ellipsis_0.3.1                magrittr_1.5                 
[39] crayon_1.3.4                  mime_0.9                     
[41] memoise_1.1.0                 tools_4.0.0                  
[43] lifecycle_0.2.0               locfit_1.5-9.4               
[45] irlba_2.3.3                   AnnotationDbi_1.51.0         
[47] compiler_4.0.0                rsvd_1.0.3                   
[49] rlang_0.4.6                   grid_4.0.0                   
[51] RCurl_1.98-1.2                BiocNeighbors_1.7.0          
[53] rappdirs_0.3.1                igraph_1.2.5                 
[55] bitops_1.0-6                  ExperimentHub_1.15.0         
[57] DBI_1.1.0                     curl_4.3                     
[59] R6_2.4.1                      dplyr_0.8.5                  
[61] fastmap_1.0.1                 bit_1.1-15.2                 
[63] Rcpp_1.0.4.6                  vctrs_0.3.0                  
[65] dbplyr_1.4.3                  tidyselect_1.1.0             
@willtownes
Copy link
Collaborator

Hi Aaron, thanks for the feedback. The problems with numerical stability for large L and tuning issues of the penalty (+ computational speed) are all known issues with glmpca we are actively working on. I am pretty close to having something ready to release that scales better for larger L values. I am now testing it to see if it can work with large numbers of cells as well. I will post something here when I release it to CRAN, and any improvements should pass through to the scry interface automatically. Also I agree with all of your other suggestions. Sorry for the frustrations!

@willtownes
Copy link
Collaborator

by the way, we would recommend using the deviance for gene filtering rather than hvgs

@LTLA
Copy link
Author

LTLA commented Jun 1, 2020

On what basis?

@willtownes
Copy link
Collaborator

it's more consistent with the multinomial model behind glm-pca. Also it gave better empirical results for eg clustering: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6/figures/5 , see also more comprehensive comparison by Germain et al where they recommend deviance for filtering: https://www.biorxiv.org/content/10.1101/2020.02.02.930578v3

However, this is totally optional and doesn't detract from your main points. You can use any filtering you choose as input to GLM-PCA. It's a separate issue from the speed and numerical problems.

@LTLA
Copy link
Author

LTLA commented Jun 1, 2020

Seems like it would be tricky to choose an appropriate dispersion for the deviance. If you used a Poisson GLM, you'd just end up with a strong mean-dependent trend if the true dispersion was a constant non-zero value across all genes. Unless you're doing your own dispersion estimation, but then how would you do that before you've accounted for the structure in the GLM-PCA?

Yes, I also did read that paper. But they recommended imputation as well and I didn't know what to make of that, so I just left it alone and moved on.

@willtownes
Copy link
Collaborator

Hi Aaron, glmpca 0.2.0 has now been released. It should be faster and more numerically stable. Please try running your experiments again and let me know if things improve or if there are further problems. Thanks!

@LTLA
Copy link
Author

LTLA commented Jul 20, 2020

Excellent. I will give it a spin sometime during the week.

@LTLA
Copy link
Author

LTLA commented Jul 27, 2020

It finishes, which is good. The example in my original post took 262 seconds, which is... tolerable, for the purposes of putting it in the book. I haven't actually tested the output yet to see whether it solves the problem that I was hoping it would solve.

I'll test out the current state, which is definitely better than what it was before. Hopefully it turns out to do what I wanted; but if it does, it'll need to be a fair bit faster for mainstream use. Maybe a minute for 10k cells, which sounds pretty reasonable.

There's probably some low-hanging fruit with BiocParallelization and DelayedArraying, especially if you can make use of the parallelized crossprod provided by DelayedArray (via the Matrix generic). Or maybe there's some minibatching or some other approximation that I'm missing? I do notice a glmpca:::create_minibatches when I was poking around in there.

@willtownes
Copy link
Collaborator

Yes, minibatch is now supported. You can do minibatch='memoized' to use cached sufficient statistics to compute gradients or minibatch='stochastic' for stochastic gradient. The advantage of those methods is to save on memory, not necessarily to be faster, but anecdotally I have seen the stochastic gradient converge faster than full gradient (the default minibatch='none'). The problem with the stochastic gradient is it's hard to assess convergence due to the noise in the estimator, which causes it to often run for many excess iterations after it's already effectively converged. To set the minibatch size use ctl=list(batch_size=1000).

Also, for the Zeisel dataset it might be worth trying fam='nb' instead of the default fam='poi' because there are a couple of genes that have really huge counts compared to all the others, and negative binomial likelihood is a bit more numerically stable than Poisson in that scenario.

Adding support for DelayedArray/ HDF5Array is on my to-do list ( willtownes/glmpca#18 ). Thanks for the suggestion about parallelizing the crossprod operation ( now included as willtownes/glmpca#22 )

@LTLA
Copy link
Author

LTLA commented Aug 15, 2020

This is the call I've currently got in the book:

sce.scry <- GLMPCA(sce.zeisel[top.hvgs,], fam="nb", 
    L=20, sz=sizeFactors(sce.zeisel))

One improvement would be to offer a subsetting argument so that I don't have to subset the SCE as input. This would ensure that the output SCE has the same dimensionality, with the only difference being a new redDim entry.

Another improvement would be to provide a SCE method where sz=sizeFactors(object) by default.

The function is still achingly slow, though. I wonder whether there is more that can be done to speed it up. For example, using the PCA on the log-values to provide the initial U and V when init= is not supplied, maybe that would get start you closer to the target space. Or sharing information across iterations with different learning rates.

(I don't really understand why multiple learning rates are even required; I'm assuming the learning rate is somehow related to the step size per iteration, but surely this could be dynamically adjusted within a run rather than restarting the run with a different rate? We use a similar approach in edgeR to guarantee that each step decreases deviance.)

@willtownes
Copy link
Collaborator

thanks for these good suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants