-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with UMI dedup and time when aligned against transcriptome #539
Comments
Thats very odd. Can you let me know the full command lines you used for both the genome and transcriptome based dedup? |
Both use the same command lines, which are very standard: |
I wonder if its something to do with pairing. Try setting There are three main things that take time/memory in UMItools:
|
Hi Ian, I have tried, but it doesn't seem to improve the performance too much. I have added |
What happens if you run without |
Okay, I will try |
Okay, it worked flawlessly! I don't understand why this is working. I put 2 fastq files into STAR to do the alignment, shouldn't I put |
Yes. For a valid output, you need to put When UMI-tools encounters a read1 it wants to keep, it adds its details to a buffer. When it encounters a read2 it checks if it is in the buffer and outputs it. If it gets to the end of the contig without finding the read2, it goes back to the beginning of the contig and scans it again to catch read2s that were 5' of the read1. All this buffer stores is the read name, contig and location, so its not much information, but the buffer must be getting so full that it is using all your memory. That shouldn't happen if you have "--chimeric-pairs=discard" as the buffer should be empty after its finished dealing with a particular contig (as it doesn't allow reads where the mate is on a different contig). But it seems that somehow it is. Perhaps you have one contig that is really, really heavily dense, and pairs tend to be a long way away? Is it possible you have a transcript sequence in your transcript reference that is masked out in the genome reference? Like repeat sequences or rRNA sequences? It depends how deep you want to go on this. I can suggest several ways forward:
|
I really want to go deep into this, since all of the samples from my project are like that. I will try to use the genome alignment, since it seems the best way to deal with this problem, and I will get back to you. If you want, I can send you the bam files in case you want to inspect them and figure out what is the problem. |
Hi Ian, I'm using the mudskipper to convert the genome alignment into the transcriptome alignment. However, this is very difficult to implement, since we are using a public pipeline that is constantly being updated and it is difficult to add this program in the middle. I am analyzing which could be the problems and if I find out which is the problem I will let you know. In the meantime I will leave here an example of a bam file against the transcriptome that is giving issues and against the genome which is working well, in case you can analyze it. https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toGenome.sorted.bam Thank you very much! |
Dear Ian, |
I think you are almost there. I think it is the large number of contigs. I
ran the files shared with me here. The genome aligned file took 10 minutes,
and used 1GB of RAM, the transcriptome aligned file took 5 hours, and used
400MB of RAM. I'll profile the run today, but my guess is that the extra
time is due, not to loading the contig into memory (which shouldn't
happen), but rather, its the act of going backwards in the file to find the
start of the contig when we look for mates.
However, I'm not seeing it take >24hrs. I had originally misread your issue
and thought that you said it was using 4TB of memory, not reading 4TB of
data from the disk.
Can I just check which version you are using?
… Message ID: ***@***.***>
|
thanks Ian, looking forward to your profiling results! |
Hi Julien The latest version on bioconda should be v1.1.2 (https://anaconda.org/bioconda/umi_tools). v1.1.0 included a change that might be relevant here. I wonder what is keeping conda from installing that version? |
Hi Ian, |
also tried explicitly specifying the version and detected some system incompatibility:
Using Ubuntu 20.04.4 LTS |
Hi Julien, There is no reason that UMI-tools should need a particular version of libgcc - this is imposed by bioconda who build all their packages against a specific glibc version. I think there is a way around it, but I can't remember, you'll have to ask over on the bioconda support. As an aside, I think this means you won't be able to install anything from the latest build of bioconda. Alternatively you should be able to download the code from github and run |
thanks Ian for your help |
Hi @IanSudbery Another, more quick-and-dirty option, would be to:
What do you think of this approach? I'm pretty sure there's a flaw in there, especially when dealing with multi-mapped reads... |
Okay, I've a couple of things to say. My profiling revealed that indeed it was the moving backward in the file to rescan each contig that was slowing the processing down. This was because, in order to not lose its place, pysam opens a new handle on the file every time it does this. I have managed to change the code to avoid this, and the run time for transcriptome aligned data is now close to that for genome aligned data. You can find/try the code on the #543 PR. However, since you are using umi_tools in production, its going to take some time for this to find it's way into a release that nf-core will use. Looking at your plan above, I think it is the way to go. Not only because of the speed, but also because of multiple mapping. When you align to the transcriptome you will get a read aligning to many different transcripts. Each alignment position will be treated independently. Thus, it is possible that a given read will be chosen on one transcript, but not another. However, this seems wrong. A read is either a duplicate or it is not. Thus deduplicating on the genome is definitely superior to deduplicating on the transcriptome. Your approach gives you the best of both worlds. I have a suggestion for doing it quicker though. If you name sort your
|
sounds great Ian, thank you so much for your remarkable responsiveness. We'll see if we can implement this workaround until nf-core includes your fix in a release |
Hello,
I'm running a pipeline that uses UMI dedup. However, we see a that when deduplicating the alignment against the transcriptome we see that it takes a huge amount of time (it can go up to more than 24h). We did not see this problem when aligning against the genome. We also have seen that the program reads a huge amount of data (around 4Tb), which we also didn't see when deduplicating the alignment against the genome, when the program read around 30Gb of data.
In our case we have a set of 12 UMIs at the beginning of the first read (we extracted the UMI using this: --bc-pattern=NNNNNNNNNNNN), the Fastq files have 15M reads, paired-end, we align with STAR and the resulting bam files weight around 400Mb (we align around 50%).
We did the libraries with very low concentrations of RNA and we did multiple PCR cycles, so we would expect the samples to be very duplicated, which I suppose could increase the time.
Which parameters could I change to improve the speed of the process?
Thank you very much,
Lluc
The text was updated successfully, but these errors were encountered: