Speed up writing mates #543

IanSudbery · 2022-07-07T09:53:51Z

Fixes #539.

When TwoPassPairWriter reaches the end of a contig, it calls write_mates, which reopens that contig from the start and scans through for remaining mates. To do this, write_mates uses pysam.AlignmentFile.fetch(...., multiple_iterators=True). As the file uses is the same filehandle being used by bundle_iterator, then multiple_iterators=True ensures that the position in the file is not lost in this operation.

multiple_iterators=True imposes some overhead. With a reasonable number of contigs this is not a problem, as the overhead is small compared to the cost of the scan. However, when alignment is done to the transcriptome, write_mates is called 100s of thousands of times, and for some reason, fetch calls the __init__ of psyam.RowIteratorRegion 4 times for each call to fetch. This causes a serious slow down, such that adding --paired to the commandline slows the processing for an example file down from a couple of minutes to five hours.

This PR changes TwoPassPairWriter so that it's __init__ opens a second file handle to the input file. This allows it to drop the requirement to use multiple_iterators=True and returns the performance to near that of the performance without --paired.

As far as I can tell, this does not change the output (i.e. the two handles act independently). I have tested this both on the test files and on an example transcriptome alignment provided in #539.

Time to run is reduced from 5 hours to 200 seconds.

…erators=True

TomSmithCGAT · 2022-07-11T13:53:09Z

Wow, that's a pathologically bad performance UMI-tools has had for transcriptome alignments 😱 Good catch!

Open a seperate file so that mates can be fetched without multiple_it…

59bacc8

…erators=True

IanSudbery requested a review from TomSmithCGAT July 7, 2022 09:53

IanSudbery mentioned this pull request Jul 7, 2022

Problems with UMI dedup and time when aligned against transcriptome #539

Closed

TomSmithCGAT approved these changes Jul 11, 2022

View reviewed changes

IanSudbery merged commit d66595e into master Jul 11, 2022

koefoeden mentioned this pull request Feb 20, 2023

Added resource parameters for UMITOOLS_DEDUP nf-core/rnaseq#947

Closed

MatthiasZepper mentioned this pull request Mar 6, 2023

Bump version of umi-tools to 1.14 nf-core/modules#2971

Merged

14 tasks

TomSmithCGAT deleted the {IS}_speed_up_write_mates branch October 3, 2024 09:27

IanSudbery mentioned this pull request Oct 11, 2024

v1.1.5 kicking out one of pair of discordant reads ; v1.1.1 not showing same behaviour #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up writing mates #543

Speed up writing mates #543

IanSudbery commented Jul 7, 2022

TomSmithCGAT commented Jul 11, 2022

Speed up writing mates #543

Speed up writing mates #543

Conversation

IanSudbery commented Jul 7, 2022

TomSmithCGAT commented Jul 11, 2022