Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/add UMI deduplication metrics #1277

Closed
ppericard opened this issue Apr 2, 2024 · 6 comments
Closed

Improve/add UMI deduplication metrics #1277

ppericard opened this issue Apr 2, 2024 · 6 comments

Comments

@ppericard
Copy link
Contributor

Description of feature

Hello ^^
I'm having difficulties finding easy to understand stats on UMI deduplication in the outputs. It seems there is no section in the multiqc output, not even in the statistics table (where I would expect to have metrics about nb of reads before dedup, nb of reads after dedup, % duplication (from umi-tools dedup on alignements)).
In the output directory, I'm also not finding a log with easy to understand metrics from umi-tools dedup. I'm probably missing something.
Thanks in advance. Pierre

@MatthiasZepper
Copy link
Member

Since people complained about the poor performance, the generation of deduplication statistics if off by default now.

You have to set the parameter --umitools_dedup_stats respectively umitools_dedup_stats : true in a params file to activate that functionality.

@ppericard
Copy link
Contributor Author

ppericard commented Apr 24, 2024

Hi @MatthiasZepper,
I'm sorry if i wasn't clear enough in my initial message. All my comments apply to the pipeline while having activated the --umitools_dedup_stats parameter.
In the *.umi_dedup.transcriptome.filtered.prepare_for_rsem.log files there are no summaries with the dedup stats, and the other files are not very informative and easy to read: *.umi_dedup.sorted_edit_distance.tsv, *.umi_dedup.sorted_per_umi_per_position.tsv, *.umi_dedup.sorted_per_umi.tsv. There is a real need for an easy to read and understand summary for deduplications, such as the one that can be obtained through Multiqc parsing of the UMI tools for exemple (MultiQC/MultiQC#1769).
Right now, as a user I have even less information about deduplication than what I would have in the logs just by running the umi-tools dedup command.

@MatthiasZepper
Copy link
Member

Apologies for stonewalling on this issue before. While hunting down the cause for issue #1303, it occurred to me that probably a botched MultiQC config is behind this issue as well. For some reason, we explicitly specify the MultiQC modules to be run and UMI-tools is nowhere to be found.

Since we run MultiQC with a custom config outside the pipeline again, we did not notice.

It should be fixed on this branch, but I struggle with testing at the moment.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Jul 12, 2024

#1308 has been merged to dev and will be released as part of rnaseq 3.15. Please give it a spin to see if it solves this issue @ppericard !

@ppericard
Copy link
Contributor Author

@MatthiasZepper Thank you for dealing with this issue. I'm currently taking an extended leave from bioinformatics for the unforseen future. So hopefully someone from the community will be able to test this. Cheers

@idot
Copy link

idot commented Oct 17, 2024

In v3.16.0 you get a nice overview of the deduplication stats.
I think you can close this ticket.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants