-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark salmon/alevin and kallisto #9
Comments
As discussed #4 (comment), for benchmarking purposes we can use prebuilt indexes. I have downloaded these (partial and full decoy) from http://refgenomes.databio.org and placed them in |
Initial results are presented in #18. Most interesting at first are the results in trace.txt, which presents the memory and CPU usage of each process. A description of the fields can be found here https://www.nextflow.io/docs/latest/tracing.html#trace-report alsf-scpca/workflows/alevin-quant/trace.txt Lines 1 to 13 in 11badd9
Notably, the full_sa jobs use much more memory, but do not seem to take much longer to run. However, the memory requirements are within the I have not yet looked at mapping rate comparisons for kmer size cDNA vs transcriptome indexes or the number of ncRNA that appear in the samples. The full mapping results can be found at s3://nextflow-ccdl-results/scpca-benchmark/alevin-quant |
I have not gotten a formal analysis done, but I wanted to share some insights that I have gotten so far (some of which were in slack, but deserve posting here). I am analyzing the results in https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/alevin-quant/benchmark-analysis.Rmd but I have not yet filed a PR from that branch. The overall mapping rates for these samples seem low, in the 13-20% range for the two samples that I looked at. Looking at the mapped data though, things look pretty normal, with good numbers of mapped reads per cell and no over-abundance of mitochondrial reads. We did not do any kind of trimming on these data; I may give them a pass through Mapping rates are somewhat higher (~2%) with the txome that includes ncRNA; for a net of about 10% of mapped transcripts being noncoding. Based on (ongoing) preliminary analysis; the mapping for coding DNA is unaffected by the inclusion of these transcripts, as would be expected. They might of course have some effect on normalization down the line. Also of note, the cDNA set includes pseudogenes (some of which are consistently expressed at the RNA level in these data), which is something I hadn't previously appreciated. The ncRNA data does include lncRNAs, at least one of which came up as the highest expressed ncRNA gene in the first sample I looked at: MALAT1, which does seem to have some cancer association in the literature. The next two most common ncRNA were mitochondrial rRNA, which is not surprising! My preliminary leaning is to go with the full transcriptome (cDNA + ncRNA), using a full SA index, as neither decision seems to have much real cost (the instances we use have enough memory to handle it) and should improve accuracy overall. |
Opened #63 to discuss updated benchmarking. |
With indexes for salmon and kallisto made, we can start benchmarking the two on real data. We are interested in memory usage and speed, as well as concordence between the different mapping methods.
For salmon, we should also look at the basic mapping vs decoy-aware mapping. I (and apparently developers of salmon) would prefer decoy-aware, but if it uses much more memory and/or time this may not be feasible.
The text was updated successfully, but these errors were encountered: