-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiences from trying to get GLMPCA to run with the book #7
Comments
Hi Aaron, thanks for the feedback. The problems with numerical stability for large L and tuning issues of the penalty (+ computational speed) are all known issues with glmpca we are actively working on. I am pretty close to having something ready to release that scales better for larger L values. I am now testing it to see if it can work with large numbers of cells as well. I will post something here when I release it to CRAN, and any improvements should pass through to the scry interface automatically. Also I agree with all of your other suggestions. Sorry for the frustrations! |
by the way, we would recommend using the deviance for gene filtering rather than hvgs |
On what basis? |
it's more consistent with the multinomial model behind glm-pca. Also it gave better empirical results for eg clustering: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6/figures/5 , see also more comprehensive comparison by Germain et al where they recommend deviance for filtering: https://www.biorxiv.org/content/10.1101/2020.02.02.930578v3 However, this is totally optional and doesn't detract from your main points. You can use any filtering you choose as input to GLM-PCA. It's a separate issue from the speed and numerical problems. |
Seems like it would be tricky to choose an appropriate dispersion for the deviance. If you used a Poisson GLM, you'd just end up with a strong mean-dependent trend if the true dispersion was a constant non-zero value across all genes. Unless you're doing your own dispersion estimation, but then how would you do that before you've accounted for the structure in the GLM-PCA? Yes, I also did read that paper. But they recommended imputation as well and I didn't know what to make of that, so I just left it alone and moved on. |
Hi Aaron, glmpca 0.2.0 has now been released. It should be faster and more numerically stable. Please try running your experiments again and let me know if things improve or if there are further problems. Thanks! |
Excellent. I will give it a spin sometime during the week. |
It finishes, which is good. The example in my original post took 262 seconds, which is... tolerable, for the purposes of putting it in the book. I haven't actually tested the output yet to see whether it solves the problem that I was hoping it would solve. I'll test out the current state, which is definitely better than what it was before. Hopefully it turns out to do what I wanted; but if it does, it'll need to be a fair bit faster for mainstream use. Maybe a minute for 10k cells, which sounds pretty reasonable. There's probably some low-hanging fruit with BiocParallelization and DelayedArraying, especially if you can make use of the parallelized |
Yes, minibatch is now supported. You can do Also, for the Zeisel dataset it might be worth trying Adding support for DelayedArray/ HDF5Array is on my to-do list ( willtownes/glmpca#18 ). Thanks for the suggestion about parallelizing the crossprod operation ( now included as willtownes/glmpca#22 ) |
This is the call I've currently got in the book: sce.scry <- GLMPCA(sce.zeisel[top.hvgs,], fam="nb",
L=20, sz=sizeFactors(sce.zeisel)) One improvement would be to offer a subsetting argument so that I don't have to subset the SCE as input. This would ensure that the output SCE has the same dimensionality, with the only difference being a new redDim entry. Another improvement would be to provide a SCE method where The function is still achingly slow, though. I wonder whether there is more that can be done to speed it up. For example, using the PCA on the log-values to provide the initial (I don't really understand why multiple learning rates are even required; I'm assuming the learning rate is somehow related to the step size per iteration, but surely this could be dynamically adjusted within a run rather than restarting the run with a different rate? We use a similar approach in edgeR to guarantee that each step decreases deviance.) |
thanks for these good suggestions. |
Consider the following stretch of code:
The most obvious issue is that I can't get it to work,
GLMPCA
conks out with:Increasing the
penalty
to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.There are also some other quality-of-life issues that you might consider addressing:
subset=
argument toGLMPCA()
so that we don't have to row-subset the entire SCE in the input, especially when only one of the matrices is of interest.SingleCellExperiment
method that defaultssz=sizeFactors(x)
.counts
inSummarizedExperiment
objects #5.Session information
The text was updated successfully, but these errors were encountered: