Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up writing mates #543

Merged
merged 1 commit into from
Jul 11, 2022
Merged

Speed up writing mates #543

merged 1 commit into from
Jul 11, 2022

Conversation

IanSudbery
Copy link
Member

Fixes #539.

When TwoPassPairWriter reaches the end of a contig, it calls write_mates, which reopens that contig from the start and scans through for remaining mates. To do this, write_mates uses pysam.AlignmentFile.fetch(...., multiple_iterators=True). As the file uses is the same filehandle being used by bundle_iterator, then multiple_iterators=True ensures that the position in the file is not lost in this operation.

multiple_iterators=True imposes some overhead. With a reasonable number of contigs this is not a problem, as the overhead is small compared to the cost of the scan. However, when alignment is done to the transcriptome, write_mates is called 100s of thousands of times, and for some reason, fetch calls the __init__ of psyam.RowIteratorRegion 4 times for each call to fetch. This causes a serious slow down, such that adding --paired to the commandline slows the processing for an example file down from a couple of minutes to five hours.

This PR changes TwoPassPairWriter so that it's __init__ opens a second file handle to the input file. This allows it to drop the requirement to use multiple_iterators=True and returns the performance to near that of the performance without --paired.

As far as I can tell, this does not change the output (i.e. the two handles act independently). I have tested this both on the test files and on an example transcriptome alignment provided in #539.

Time to run is reduced from 5 hours to 200 seconds.

@TomSmithCGAT
Copy link
Member

Wow, that's a pathologically bad performance UMI-tools has had for transcriptome alignments 😱 Good catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems with UMI dedup and time when aligned against transcriptome
2 participants