Problems with UMI dedup and time when aligned against transcriptome #539

lcabus-flomics · 2022-06-30T06:15:48Z

Hello,

I'm running a pipeline that uses UMI dedup. However, we see a that when deduplicating the alignment against the transcriptome we see that it takes a huge amount of time (it can go up to more than 24h). We did not see this problem when aligning against the genome. We also have seen that the program reads a huge amount of data (around 4Tb), which we also didn't see when deduplicating the alignment against the genome, when the program read around 30Gb of data.

In our case we have a set of 12 UMIs at the beginning of the first read (we extracted the UMI using this: --bc-pattern=NNNNNNNNNNNN), the Fastq files have 15M reads, paired-end, we align with STAR and the resulting bam files weight around 400Mb (we align around 50%).

We did the libraries with very low concentrations of RNA and we did multiple PCR cycles, so we would expect the samples to be very duplicated, which I suppose could increase the time.

Which parameters could I change to improve the speed of the process?

Thank you very much,
Lluc

IanSudbery · 2022-06-30T14:44:57Z

Thats very odd. Can you let me know the full command lines you used for both the genome and transcriptome based dedup?

lcabus-flomics · 2022-06-30T17:34:46Z

Both use the same command lines, which are very standard: umi_tools dedup -I $bam -S ${prefix}.bam --paired.

IanSudbery · 2022-06-30T17:57:36Z

I wonder if its something to do with pairing. Try setting --chimeric-pairs=discard

There are three main things that take time/memory in UMItools:

Single locations with a very large number of UMIs. Once you get past about 30% UMI saturation at a single location, lots of the assumptions about the algorithm break down. But with 15M reads and 12bp UMIs, I would not expect this to be the problem. And even if it were, I would not expect it to differ between alignment to genome and alignment to transcriptome.
Generating stats (which you are not doing).
Storing a large number of reads in memory while their mate is found. Alignment to the transcriptome might be thought to produce a lot of chimeric reads (reads where each mate aligns to different contigs).

lcabus-flomics · 2022-07-01T13:30:53Z

Hi Ian,

I have tried, but it doesn't seem to improve the performance too much. I have added --chimeric-pairs=discard --unpaired-reads=discard just in case, but I launched hours ago and it is still processing.

IanSudbery · 2022-07-01T13:40:50Z

What happens if you run without --paired? This will produce an invalid output, but it might tell us if the pairing is the problem.

lcabus-flomics · 2022-07-01T13:48:20Z

Okay, I will try

lcabus-flomics · 2022-07-01T14:05:40Z

Okay, it worked flawlessly! I don't understand why this is working. I put 2 fastq files into STAR to do the alignment, shouldn't I put --paired?

IanSudbery · 2022-07-01T14:23:51Z

Yes. For a valid output, you need to put --paired.

When UMI-tools encounters a read1 it wants to keep, it adds its details to a buffer. When it encounters a read2 it checks if it is in the buffer and outputs it. If it gets to the end of the contig without finding the read2, it goes back to the beginning of the contig and scans it again to catch read2s that were 5' of the read1.

All this buffer stores is the read name, contig and location, so its not much information, but the buffer must be getting so full that it is using all your memory. That shouldn't happen if you have "--chimeric-pairs=discard" as the buffer should be empty after its finished dealing with a particular contig (as it doesn't allow reads where the mate is on a different contig). But it seems that somehow it is.

Perhaps you have one contig that is really, really heavily dense, and pairs tend to be a long way away? Is it possible you have a transcript sequence in your transcript reference that is masked out in the genome reference? Like repeat sequences or rRNA sequences?

It depends how deep you want to go on this. I can suggest several ways forward:

Use the genome alignment. Convert it to a transcriptome alignment after dedup if you need using something like "mudskipper".
Break the bam down into smaller chunks and work out if there is one particular chunk that is causing problems.
Send me a sample BAM file, and let me do the above.

lcabus-flomics · 2022-07-01T15:25:32Z

I really want to go deep into this, since all of the samples from my project are like that. I will try to use the genome alignment, since it seems the best way to deal with this problem, and I will get back to you. If you want, I can send you the bam files in case you want to inspect them and figure out what is the problem.

lcabus-flomics · 2022-07-04T14:23:45Z

Hi Ian,

I'm using the mudskipper to convert the genome alignment into the transcriptome alignment. However, this is very difficult to implement, since we are using a public pipeline that is constantly being updated and it is difficult to add this program in the middle.

I am analyzing which could be the problems and if I find out which is the problem I will let you know. In the meantime I will leave here an example of a bam file against the transcriptome that is giving issues and against the genome which is working well, in case you can analyze it.

https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toGenome.sorted.bam
https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toGenome.sorted.bam.bai
https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toTranscriptome.sorted.bam
https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toTranscriptome.sorted.bam.bai

Thank you very much!
Lluc

julienlag · 2022-07-05T09:15:09Z

Dear Ian,
colleague of @lcabus-flomics here. Thank you very much for your help!
We're still running tests here to troubleshoot the problem. One thing we've noticed with one of the problematic BAM files is that no contig is particularly complex -- the most "read-rich" contig/transcript contains ~2,500 reads only, and it's processed smoothly by umi_tools in a few seconds.
Could it be that the long running time is just due the huge number of contigs we have in the transcriptome BAM file (>140,000)? The issue is that some of our samples contain a high proportion of contaminating genomic reads, which leads to most annotated transcripts being covered by at least a few reads -- hence the large number of contigs/transcripts in the input BAM file. My hypothesis is that loading each contig into memory is computationally costly for umi_tools due to some kind of overhead, so the whole process becomes very slow (but linearly so) when fed with a very large number of contigs. I hope I'm being clear. What do you think?

IanSudbery · 2022-07-05T09:55:54Z

I think you are almost there. I think it is the large number of contigs. I ran the files shared with me here. The genome aligned file took 10 minutes, and used 1GB of RAM, the transcriptome aligned file took 5 hours, and used 400MB of RAM. I'll profile the run today, but my guess is that the extra time is due, not to loading the contig into memory (which shouldn't happen), but rather, its the act of going backwards in the file to find the start of the contig when we look for mates. However, I'm not seeing it take >24hrs. I had originally misread your issue and thought that you said it was using 4TB of memory, not reading 4TB of data from the disk. Can I just check which version you are using?

…

Message ID: ***@***.***>

julienlag · 2022-07-05T10:05:38Z

thanks Ian, looking forward to your profiling results!
for my tests i've used v1.0.1 (latest from bioconda)
in prod we use the one included in the latest version of nf-core/rnaseq, which we don't really have control over, as far as I understand

IanSudbery · 2022-07-05T10:34:02Z

Hi Julien

The latest version on bioconda should be v1.1.2 (https://anaconda.org/bioconda/umi_tools). v1.1.0 included a change that might be relevant here. I wonder what is keeping conda from installing that version?

julienlag · 2022-07-05T13:01:02Z

Hi Ian,
we tried to update and reinstall the conda package and whatever we do, we're stuck with umi_tools-1.0.1. I really have no idea what prevents conda from installing the latest version

julienlag · 2022-07-05T13:12:15Z

also tried explicitly specifying the version and detected some system incompatibility:

conda create --name umi-tools-TEST_env -c bioconda umi_tools=1.1.2
                                                                                                                    
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: - 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                

UnsatisfiableError: The following specifications were found to be incompatible with your system:

- feature:/linux-64::__glibc==2.31=0
- umi_tools=1.1.2 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.31

Using Ubuntu 20.04.4 LTS

IanSudbery · 2022-07-05T13:34:04Z

Hi Julien,

There is no reason that UMI-tools should need a particular version of libgcc - this is imposed by bioconda who build all their packages against a specific glibc version. I think there is a way around it, but I can't remember, you'll have to ask over on the bioconda support. As an aside, I think this means you won't be able to install anything from the latest build of bioconda.

Alternatively you should be able to download the code from github and run python setup.py install as long as all the dependencies are installed.
Or you could try using pip.

julienlag · 2022-07-05T13:43:04Z

thanks Ian for your help
In any case @lcabus-flomics has confirmed to me that our production environment (nf-core/rnaseq) uses version 1.1.2, which means the version evidently does not make much difference in terms of running time...

julienlag · 2022-07-06T09:26:37Z

Hi @IanSudbery
we're looking into a solution around this issue. You mentioned mudskipper, which we tried, however we quickly encountered this bug. It looks to us like mudskipper is not ready for production use. What's your take on this?

Another, more quick-and-dirty option, would be to:

Produce both genome.bam and transcriptome.bam with STAR
Run umi_tools dedup on genome.bam, output to genome.dedup.bam
Extract the list of deduplicated reads IDs from genome.dedup.bam , something like:
samtools view genome.dedup.bam | cut -f1 | sort | uniq > genome.dedup.readIDs.txt
Use this list of read IDs to select corresponding records in transcriptome.bam , as in:
samtools view transcriptome.bam | grep -F -f genome.dedup.readIDs.txt > transcriptome.dedup.sam
(grep may be too slow here, we may want to use join instead, but you get the idea)

What do you think of this approach? I'm pretty sure there's a flaw in there, especially when dealing with multi-mapped reads...

IanSudbery · 2022-07-07T10:07:35Z

Okay, I've a couple of things to say.

My profiling revealed that indeed it was the moving backward in the file to rescan each contig that was slowing the processing down. This was because, in order to not lose its place, pysam opens a new handle on the file every time it does this. I have managed to change the code to avoid this, and the run time for transcriptome aligned data is now close to that for genome aligned data. You can find/try the code on the #543 PR.

However, since you are using umi_tools in production, its going to take some time for this to find it's way into a release that nf-core will use.

Looking at your plan above, I think it is the way to go. Not only because of the speed, but also because of multiple mapping. When you align to the transcriptome you will get a read aligning to many different transcripts. Each alignment position will be treated independently. Thus, it is possible that a given read will be chosen on one transcript, but not another. However, this seems wrong. A read is either a duplicate or it is not. Thus deduplicating on the genome is definitely superior to deduplicating on the transcriptome. Your approach gives you the best of both worlds.

I have a suggestion for doing it quicker though. If you name sort your transcriptome.bam and also sort your genome.dedup.readIDs.txt, then the following should be faster:

import pysam
names_to_keep = open("genome.dedup.readIDs.txt")
transcriptome_bam = pysam.AlignmentFile("transcriptome.bam")
out_bam = pysam.AlignmentFile("transcriptome.dedup.bam", "w", template=transcriptome_bam)
current_name = names_to_keep.next()
current_read = transcriptome_bam.next()

while current_name and current_read:

    if current_read.query_name == current_name.strip():
        out_bam.write(current_read)
        current_name = names_to_keep.next()
    
    try:
        current_read = transcriptome_bam.next()
    except StopIteration:
        break

out_bam.close()

julienlag · 2022-07-07T10:18:08Z

sounds great Ian, thank you so much for your remarkable responsiveness. We'll see if we can implement this workaround until nf-core includes your fix in a release

IanSudbery mentioned this issue Jul 7, 2022

Speed up writing mates #543

Merged

IanSudbery closed this as completed in #543 Jul 11, 2022

YichaoOU mentioned this issue Mar 2, 2023

dedup with paired option #581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with UMI dedup and time when aligned against transcriptome #539

Problems with UMI dedup and time when aligned against transcriptome #539

lcabus-flomics commented Jun 30, 2022 •

edited

Loading

IanSudbery commented Jun 30, 2022

lcabus-flomics commented Jun 30, 2022

IanSudbery commented Jun 30, 2022

lcabus-flomics commented Jul 1, 2022

IanSudbery commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022 •

edited

Loading

IanSudbery commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022

lcabus-flomics commented Jul 4, 2022

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022 via email

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022

julienlag commented Jul 5, 2022

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022

julienlag commented Jul 5, 2022

julienlag commented Jul 6, 2022

IanSudbery commented Jul 7, 2022

julienlag commented Jul 7, 2022

Problems with UMI dedup and time when aligned against transcriptome #539

Problems with UMI dedup and time when aligned against transcriptome #539

Comments

lcabus-flomics commented Jun 30, 2022 • edited Loading

IanSudbery commented Jun 30, 2022

lcabus-flomics commented Jun 30, 2022

IanSudbery commented Jun 30, 2022

lcabus-flomics commented Jul 1, 2022

IanSudbery commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022 • edited Loading

IanSudbery commented Jul 1, 2022

lcabus-flomics commented Jul 1, 2022

lcabus-flomics commented Jul 4, 2022

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022 via email

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022

julienlag commented Jul 5, 2022

julienlag commented Jul 5, 2022

IanSudbery commented Jul 5, 2022

julienlag commented Jul 5, 2022

julienlag commented Jul 6, 2022

IanSudbery commented Jul 7, 2022

julienlag commented Jul 7, 2022

lcabus-flomics commented Jun 30, 2022 •

edited

Loading

lcabus-flomics commented Jul 1, 2022 •

edited

Loading