Benchmark salmon/alevin and kallisto #9

jashapiro · 2020-08-20T15:31:23Z

With indexes for salmon and kallisto made, we can start benchmarking the two on real data. We are interested in memory usage and speed, as well as concordence between the different mapping methods.

For salmon, we should also look at the basic mapping vs decoy-aware mapping. I (and apparently developers of salmon) would prefer decoy-aware, but if it uses much more memory and/or time this may not be feasible.

jashapiro · 2020-08-21T19:59:51Z

As discussed #4 (comment), for benchmarking purposes we can use prebuilt indexes. I have downloaded these (partial and full decoy) from http://refgenomes.databio.org and placed them in s3://nextflow-ccdl-data/reference/homo_sapiens/refgenomes-hg38/

jashapiro · 2020-09-01T12:59:56Z

Initial results are presented in #18. Most interesting at first are the results in trace.txt, which presents the memory and CPU usage of each process. A description of the fields can be found here https://www.nextflow.io/docs/latest/tracing.html#trace-report

alsf-scpca/workflows/alevin-quant/trace.txt

Lines 1 to 13 in 11badd9

    
           task_id	hash	native_id	name	status	exit	submit	duration	realtime	%cpu	peak_rss	peak_vmem	rchar	wchar 
        
           7	2a/19fe3a	090ebaa5-66e2-48f4-8959-beea91bc23f7	alevin (905_3-cdna_k31_no_sa)	COMPLETED	0	2020-08-31 20:43:48.958	59m 30s	33m 1s	529.4%	1.9 GB	3 GB	28.4 GB	58.3 MB 
        
           9	a8/0a2a33	91899763-c8dd-4e55-9662-1b4215a2cefc	alevin (905_3-txome_k31_no_sa)	COMPLETED	0	2020-08-31 20:43:49.094	1h 29s	34m 19s	535.4%	2.2 GB	3 GB	28.6 GB	59.3 MB 
        
           12	83/708fcd	2b86c8b5-7469-47ec-bef5-f72939bdd699	alevin (905_3-cdna_k31_full_sa)	COMPLETED	0	2020-08-31 20:43:49.239	1h 5m 51s	37m 59s	528.9%	18 GB	21.5 GB	44.3 GB	58.2 MB 
        
           11	02/7fbd19	f766e8e5-ab4d-4a15-8624-7ffb3fa0b47a	alevin (905_3-cdna_k31_partial_sa)	COMPLETED	0	2020-08-31 20:43:49.217	1h 8m 21s	42m 54s	551.5%	2.3 GB	3.4 GB	28.7 GB	58.3 MB 
        
           8	6d/289107	cb4c2738-05cd-4ea5-abda-fb5ddc30fb78	alevin (905_3-cdna_k23_no_sa)	COMPLETED	0	2020-08-31 20:43:48.905	1h 13m 31s	48m 55s	557.8%	2.1 GB	3.1 GB	28.4 GB	58.3 MB 
        
           1	68/9634ca	c0e7c168-179a-4cdb-9ac4-2d81bf94b92b	alevin (834-cdna_k31_no_sa)	COMPLETED	0	2020-08-31 20:43:48.712	1h 14m 20s	1h 4m 57s	564.3%	3 GB	4.2 GB	30.6 GB	68.1 MB 
        
           10	4b/088f05	5819be1a-a843-407f-a378-6531101524f2	alevin (905_3-txome_k23_no_sa)	COMPLETED	0	2020-08-31 20:43:49.215	1h 14m 20s	50m 31s	557.6%	2.4 GB	3.4 GB	28.6 GB	59.3 MB 
        
           4	9f/d04d76	47802779-1c74-491f-8565-35472f2d4956	alevin (834-txome_k31_no_sa)	COMPLETED	0	2020-08-31 20:43:48.747	1h 32m 42s	1h 10m 18s	565.2%	3.4 GB	4.4 GB	30.8 GB	70.5 MB 
        
           6	e3/e71cf9	2a27a3a6-ef4a-4317-a4c4-db396f693d2c	alevin (834-cdna_k31_full_sa)	COMPLETED	0	2020-08-31 20:43:48.793	1h 39m 11s	1h 13m 36s	571.3%	19.1 GB	22.6 GB	46.5 GB	67.4 MB 
        
           5	17/3adaca	72b2bcc5-50ab-4124-82e6-90e20e3de94a	alevin (834-cdna_k31_partial_sa)	COMPLETED	0	2020-08-31 20:43:48.763	1h 47m 2s	1h 33m 33s	579.3%	3.4 GB	4.7 GB	30.9 GB	67.9 MB 
        
           2	8a/7e105e	6b1a6262-f0ee-4167-ab1d-50b41c120b6e	alevin (834-cdna_k23_no_sa)	COMPLETED	0	2020-08-31 20:43:48.700	1h 55m 33s	1h 40m 12s	581.0%	3.2 GB	4.6 GB	30.6 GB	68.1 MB 
        
           3	70/d5ef01	3a8978af-c569-4e57-814f-e8b0e3cb0fb4	alevin (834-txome_k23_no_sa)	COMPLETED	0	2020-08-31 20:43:48.704	1h 59m 42s	1h 48m 22s	581.9%	3.7 GB	4.8 GB	30.8 GB	70.5 MB

Notably, the full_sa jobs use much more memory, but do not seem to take much longer to run. However, the memory requirements are within the m4.2xlarge instance size when run with that instance's 8 cpus, making it seem likely that using the full selective alignment index is worth doing.

I have not yet looked at mapping rate comparisons for kmer size cDNA vs transcriptome indexes or the number of ncRNA that appear in the samples. The full mapping results can be found at s3://nextflow-ccdl-results/scpca-benchmark/alevin-quant

jashapiro · 2020-09-02T21:03:39Z

I have not gotten a formal analysis done, but I wanted to share some insights that I have gotten so far (some of which were in slack, but deserve posting here). I am analyzing the results in https://github.com/AlexsLemonade/alsf-scpca/blob/jashapiro/benchmark-analysis/workflows/alevin-quant/benchmark-analysis.Rmd but I have not yet filed a PR from that branch.

The overall mapping rates for these samples seem low, in the 13-20% range for the two samples that I looked at. Looking at the mapped data though, things look pretty normal, with good numbers of mapped reads per cell and no over-abundance of mitochondrial reads. We did not do any kind of trimming on these data; I may give them a pass through fastp just to see if there is generally low quality sequence that gets removed that way and might explain the low mapping rate.

Mapping rates are somewhat higher (~2%) with the txome that includes ncRNA; for a net of about 10% of mapped transcripts being noncoding. Based on (ongoing) preliminary analysis; the mapping for coding DNA is unaffected by the inclusion of these transcripts, as would be expected. They might of course have some effect on normalization down the line. Also of note, the cDNA set includes pseudogenes (some of which are consistently expressed at the RNA level in these data), which is something I hadn't previously appreciated.

The ncRNA data does include lncRNAs, at least one of which came up as the highest expressed ncRNA gene in the first sample I looked at: MALAT1, which does seem to have some cancer association in the literature. The next two most common ncRNA were mitochondrial rRNA, which is not surprising!

My preliminary leaning is to go with the full transcriptome (cDNA + ncRNA), using a full SA index, as neither decision seems to have much real cost (the instances we use have enough memory to handle it) and should improve accuracy overall.

allyhawkins · 2021-03-31T23:07:39Z

Opened #63 to discuss updated benchmarking.

jashapiro mentioned this issue Sep 1, 2020

Initial alevin benchmarking result #18

Merged

jashapiro mentioned this issue Nov 2, 2020

Quantification comparison, PR2 of 2: Plot mapper comparisons #45

Merged

This was referenced Mar 26, 2021

Benchmark Cellranger 6.0.0 #59

Closed

Benchmarking Plan for ScPCA #63

Open

allyhawkins closed this as completed Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark salmon/alevin and kallisto #9

Benchmark salmon/alevin and kallisto #9

jashapiro commented Aug 20, 2020

jashapiro commented Aug 21, 2020

jashapiro commented Sep 1, 2020 •

edited

Loading

jashapiro commented Sep 2, 2020

allyhawkins commented Mar 31, 2021 •

edited

Loading

Benchmark salmon/alevin and kallisto #9

Benchmark salmon/alevin and kallisto #9

Comments

jashapiro commented Aug 20, 2020

jashapiro commented Aug 21, 2020

jashapiro commented Sep 1, 2020 • edited Loading

jashapiro commented Sep 2, 2020

allyhawkins commented Mar 31, 2021 • edited Loading

jashapiro commented Sep 1, 2020 •

edited

Loading

allyhawkins commented Mar 31, 2021 •

edited

Loading