Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark salmon/alevin and kallisto #9

Closed
jashapiro opened this issue Aug 20, 2020 · 4 comments
Closed

Benchmark salmon/alevin and kallisto #9

jashapiro opened this issue Aug 20, 2020 · 4 comments

Comments

@jashapiro
Copy link
Member

With indexes for salmon and kallisto made, we can start benchmarking the two on real data. We are interested in memory usage and speed, as well as concordence between the different mapping methods.

For salmon, we should also look at the basic mapping vs decoy-aware mapping. I (and apparently developers of salmon) would prefer decoy-aware, but if it uses much more memory and/or time this may not be feasible.

@jashapiro
Copy link
Member Author

As discussed #4 (comment), for benchmarking purposes we can use prebuilt indexes. I have downloaded these (partial and full decoy) from http://refgenomes.databio.org and placed them in s3://nextflow-ccdl-data/reference/homo_sapiens/refgenomes-hg38/

@jashapiro
Copy link
Member Author

jashapiro commented Sep 1, 2020

Initial results are presented in #18. Most interesting at first are the results in trace.txt, which presents the memory and CPU usage of each process. A description of the fields can be found here https://www.nextflow.io/docs/latest/tracing.html#trace-report

task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
7 2a/19fe3a 090ebaa5-66e2-48f4-8959-beea91bc23f7 alevin (905_3-cdna_k31_no_sa) COMPLETED 0 2020-08-31 20:43:48.958 59m 30s 33m 1s 529.4% 1.9 GB 3 GB 28.4 GB 58.3 MB
9 a8/0a2a33 91899763-c8dd-4e55-9662-1b4215a2cefc alevin (905_3-txome_k31_no_sa) COMPLETED 0 2020-08-31 20:43:49.094 1h 29s 34m 19s 535.4% 2.2 GB 3 GB 28.6 GB 59.3 MB
12 83/708fcd 2b86c8b5-7469-47ec-bef5-f72939bdd699 alevin (905_3-cdna_k31_full_sa) COMPLETED 0 2020-08-31 20:43:49.239 1h 5m 51s 37m 59s 528.9% 18 GB 21.5 GB 44.3 GB 58.2 MB
11 02/7fbd19 f766e8e5-ab4d-4a15-8624-7ffb3fa0b47a alevin (905_3-cdna_k31_partial_sa) COMPLETED 0 2020-08-31 20:43:49.217 1h 8m 21s 42m 54s 551.5% 2.3 GB 3.4 GB 28.7 GB 58.3 MB
8 6d/289107 cb4c2738-05cd-4ea5-abda-fb5ddc30fb78 alevin (905_3-cdna_k23_no_sa) COMPLETED 0 2020-08-31 20:43:48.905 1h 13m 31s 48m 55s 557.8% 2.1 GB 3.1 GB 28.4 GB 58.3 MB
1 68/9634ca c0e7c168-179a-4cdb-9ac4-2d81bf94b92b alevin (834-cdna_k31_no_sa) COMPLETED 0 2020-08-31 20:43:48.712 1h 14m 20s 1h 4m 57s 564.3% 3 GB 4.2 GB 30.6 GB 68.1 MB
10 4b/088f05 5819be1a-a843-407f-a378-6531101524f2 alevin (905_3-txome_k23_no_sa) COMPLETED 0 2020-08-31 20:43:49.215 1h 14m 20s 50m 31s 557.6% 2.4 GB 3.4 GB 28.6 GB 59.3 MB
4 9f/d04d76 47802779-1c74-491f-8565-35472f2d4956 alevin (834-txome_k31_no_sa) COMPLETED 0 2020-08-31 20:43:48.747 1h 32m 42s 1h 10m 18s 565.2% 3.4 GB 4.4 GB 30.8 GB 70.5 MB
6 e3/e71cf9 2a27a3a6-ef4a-4317-a4c4-db396f693d2c alevin (834-cdna_k31_full_sa) COMPLETED 0 2020-08-31 20:43:48.793 1h 39m 11s 1h 13m 36s 571.3% 19.1 GB 22.6 GB 46.5 GB 67.4 MB
5 17/3adaca 72b2bcc5-50ab-4124-82e6-90e20e3de94a alevin (834-cdna_k31_partial_sa) COMPLETED 0 2020-08-31 20:43:48.763 1h 47m 2s 1h 33m 33s 579.3% 3.4 GB 4.7 GB 30.9 GB 67.9 MB
2 8a/7e105e 6b1a6262-f0ee-4167-ab1d-50b41c120b6e alevin (834-cdna_k23_no_sa) COMPLETED 0 2020-08-31 20:43:48.700 1h 55m 33s 1h 40m 12s 581.0% 3.2 GB 4.6 GB 30.6 GB 68.1 MB
3 70/d5ef01 3a8978af-c569-4e57-814f-e8b0e3cb0fb4 alevin (834-txome_k23_no_sa) COMPLETED 0 2020-08-31 20:43:48.704 1h 59m 42s 1h 48m 22s 581.9% 3.7 GB 4.8 GB 30.8 GB 70.5 MB

Notably, the full_sa jobs use much more memory, but do not seem to take much longer to run. However, the memory requirements are within the m4.2xlarge instance size when run with that instance's 8 cpus, making it seem likely that using the full selective alignment index is worth doing.

I have not yet looked at mapping rate comparisons for kmer size cDNA vs transcriptome indexes or the number of ncRNA that appear in the samples. The full mapping results can be found at s3://nextflow-ccdl-results/scpca-benchmark/alevin-quant

@jashapiro
Copy link
Member Author

I have not gotten a formal analysis done, but I wanted to share some insights that I have gotten so far (some of which were in slack, but deserve posting here). I am analyzing the results in https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/alevin-quant/benchmark-analysis.Rmd but I have not yet filed a PR from that branch.

The overall mapping rates for these samples seem low, in the 13-20% range for the two samples that I looked at. Looking at the mapped data though, things look pretty normal, with good numbers of mapped reads per cell and no over-abundance of mitochondrial reads. We did not do any kind of trimming on these data; I may give them a pass through fastp just to see if there is generally low quality sequence that gets removed that way and might explain the low mapping rate.

Mapping rates are somewhat higher (~2%) with the txome that includes ncRNA; for a net of about 10% of mapped transcripts being noncoding. Based on (ongoing) preliminary analysis; the mapping for coding DNA is unaffected by the inclusion of these transcripts, as would be expected. They might of course have some effect on normalization down the line. Also of note, the cDNA set includes pseudogenes (some of which are consistently expressed at the RNA level in these data), which is something I hadn't previously appreciated.

The ncRNA data does include lncRNAs, at least one of which came up as the highest expressed ncRNA gene in the first sample I looked at: MALAT1, which does seem to have some cancer association in the literature. The next two most common ncRNA were mitochondrial rRNA, which is not surprising!

My preliminary leaning is to go with the full transcriptome (cDNA + ncRNA), using a full SA index, as neither decision seems to have much real cost (the instances we use have enough memory to handle it) and should improve accuracy overall.

@allyhawkins
Copy link
Member

allyhawkins commented Mar 31, 2021

Opened #63 to discuss updated benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants