Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #539.
When
TwoPassPairWriter
reaches the end of a contig, it callswrite_mates
, which reopens that contig from the start and scans through for remaining mates. To do this,write_mates
usespysam.AlignmentFile.fetch(...., multiple_iterators=True)
. As the file uses is the same filehandle being used bybundle_iterator
, thenmultiple_iterators=True
ensures that the position in the file is not lost in this operation.multiple_iterators=True
imposes some overhead. With a reasonable number of contigs this is not a problem, as the overhead is small compared to the cost of the scan. However, when alignment is done to the transcriptome,write_mates
is called 100s of thousands of times, and for some reason,fetch
calls the__init__
ofpsyam.RowIteratorRegion
4 times for each call tofetch
. This causes a serious slow down, such that adding--paired
to the commandline slows the processing for an example file down from a couple of minutes to five hours.This PR changes
TwoPassPairWriter
so that it's__init__
opens a second file handle to the input file. This allows it to drop the requirement to usemultiple_iterators=True
and returns the performance to near that of the performance without--paired
.As far as I can tell, this does not change the output (i.e. the two handles act independently). I have tested this both on the test files and on an example transcriptome alignment provided in #539.
Time to run is reduced from 5 hours to 200 seconds.