Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed bottleneck: Avoid re-parsing bam header when reading many intervals #40

Open
RoboStephen opened this issue Oct 18, 2019 · 4 comments

Comments

@RoboStephen
Copy link

I've got come code which parses many intervals from a bam file, with calls to the ParseRange() method. The results are correct, but slower than expected.

Reading through the source code (and running the profiler), it looks like the slowdown came from time spent re-parsing the bam header and index in each each call to ParseRange(). Most of my time was spent in BamParser.GetHeader() or BamIndexStorage.Read(). For the particular use case I'm working on, the overhead of re-reading and re-parsing is high. (My bam file has >10000 contigs. I'm parsing several thousand intervals. There are ~100 aligned reads for a typical interval, though some have far higher coverage).

I'd like to speed this up, preferably in a non-hacky way. I've got a change to my copy of the code that does the trick for my use case. The parser caches its I added an optional "useCaching" flag to the ParseRange() method (false by default). If the flag is set to true, the parser re-use the existing SAMAlignmentHeader and BAMIndex objects (if they exist - if they're null we parse as usual). With the current patch it's up to the caller to avoid setting this flag to true when switching between bam files, though that could be an easy sanity-check to add. (And of course, this caching would not be appropriate when parsing a bam file that has been updated on disk in between one ParseRange() call and the next). I can polish that up for a pull request if that sounds like a sane way to proceed and not too much of a corner case to worry about.

@jamesmhogan
Copy link

jamesmhogan commented Oct 21, 2019 via email

@RoboStephen
Copy link
Author

Sounds good! What's the best way to share the code changes? I tried submitting a pull request, but I don't think I have the permissions to push my local branch (cacheBamHeader) to github. For now, let me upload the diffs for the two affected files - that's better than nothing!

Diff.BAMParser.txt
Diff.BAMParserExtensions.txt

@RoboStephen
Copy link
Author

Also....it's a nitpick, but there's one L too many in the file name DotNetBio-Fulll.sln (should be Full instead of Fulll). As long as we're contemplating changes we could fix that too.

@RoboStephen
Copy link
Author

Update: Here are diffs for a new-and-improved fix (all tests pass, and a fix for issue #41 is incorporated as well)
MyDiffs.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants