-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index merging #1263
Index merging #1263
Conversation
35e013e
to
f67022f
Compare
What is the state of this? is it waiting for a htsspecs pr? let's rebase this and move it forward please! |
@yfarjoun it's waiting for a review. |
f67022f
to
4ed2460
Compare
I've rebased this, and also added a I think the main point for discussion is how to structure these changes with regard to testing. There are no direct unit tests as the testing is in Disq (e.g. https://github.com/disq-bio/disq/blob/master/src/test/java/org/disq_bio/disq/BaiMergingTest.java). They would be hard to move here without also exposing code to produce headerless part files that are the files to be indexed. I'd be interested to hear people's thoughts on this. |
4ed2460
to
f18aa28
Compare
Codecov Report
@@ Coverage Diff @@
## master #1263 +/- ##
===============================================
+ Coverage 68.046% 68.375% +0.329%
- Complexity 8373 8485 +112
===============================================
Files 573 583 +10
Lines 33893 34372 +479
Branches 5663 5731 +68
===============================================
+ Hits 23063 23502 +439
- Misses 8642 8649 +7
- Partials 2188 2221 +33
|
@tomwhite What would we need to push to htsjdk in order to run tests on this? Do we need the ability to generate headerless bam snippets? Could we push that to htsjdk first and then port this with some tests? |
Yes, that's what we need. I will write something to do this so we can test the code in this PR. |
@lbergelson @cmnbroad I've now added tests for all the merge pathways, see I've left the code for constructing partitioned BAM files (i.e. with header, part files, terminator) in the tests themselves. This is because it's not clear what the API for writing partitioned files should look like in general. It's not a |
cefdb4b
to
9bf8d90
Compare
Following discussion with @lbergelson I've documented the file format for partitioned BAM, CRAM, and VCF. This should be ready for review now. |
Co-Authored-By: Phil Shapiro <pshapiro@broadinstitute.org>
(Some notes from a code review session with @lbergelson and @droazen.) The high-level idea is that BAM, CRAM, or VCF files can be written in partitioned format (which for BAM is defined in BAMMergerTest). Note that the partitioned file format is defined in a test, not a main class since there are no public implementations. This is because there is no driving use case in htsjdk itself. If there was we’d have to decide things like: what is the concurrency model for writing partitions in parallel, which is out of scope for this PR. There is a new abstract class, IndexMerger. Its subclasses know how to take a partitioned file and merge it into a single top-level file. There are implementations for two types of BAM index (bai, sbi), CRAM (crai), and Tabix (tbi). There is a lot of detail in the implementations. Mainly it is about how to shift virtual offsets - e.g. if a part file begins at (non-virtual) offset x, then what are the virtual offsets in the part file index changed to? Answer: BlockCompressedFilePointerUtil#shift. The changes for BAM include changes to add a shift method to BAMIndexMetaData, Bin, Chunk, and so on. SBI is simpler since it is just a file of offsets. TBI is very similar to BAI - mainly just slight API differences in htsjdk. CRAI is simple compared to BAI. Complications
Impact
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite A bunch of minor requests, mostly documentation and file organization. I think it would be good to put these all together in 1 package if possible to reduce visibility to some of the auxiliary classes and make it clear how they relate to eachother. We also need documentation somewhere in the main code explaining the point of this whole thing. That sort of exists in the tests but it would be good to have it in the actual code as well.
The copy of the tabix indexer bugs me, but I poked at it a bit and it didn't seem like there was an easy way to reduce the code copying. If you can find a clever way to refactor it so it's cleaner then 👍 but I didn't see a great way.
final Bin bin = new Bin(referenceSequence, binNumber); | ||
for (Chunk newChunk : allChunks) { | ||
// logic is from BinningIndexBuilder#processFeature | ||
final long chunkStart = newChunk.getChunkStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this block could probably be extracted as a method Bin.addChunk() and replace duplicate code here, CRAMBAIIndexer, and BinningIndexBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Merges BAM index files for (headerless) parts of a BAM file into a single | ||
* index file. The index files must have been produced using an uninitialized window (TODO). | ||
*/ | ||
public class BAMIndexMerger { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this final and give it a private no-args constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made class final. We can't give it a private no-args constructor as it is referenced from the tests (and Disq), and there are two required final args.
} | ||
final int numReferences = bais.get(0).getNumberOfReferences(); | ||
for (AbstractBAMFileIndex bai : bais) { | ||
if (bai.getNumberOfReferences() != numReferences) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a stronger check of comparing sequence dictionaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
final SAMSequenceDictionary dict = header.getSequenceDictionary(); | ||
final List<AbstractBAMFileIndex> bais = new ArrayList<>(); | ||
for (SeekableStream baiStream : baiStreams) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be one of the rare use cases for a parallel stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index files are all read into memory (in the test) so each call to AbstractBAMFileIndex
is a a memory-backed index, so there's probably not much call for a parallel stream. It would be easy to change later if it showed a performance improvement.
Arrays.parallelPrefix(offsets, (a, b) -> a + b); // cumulative offsets | ||
|
||
try (BinaryBAMIndexWriter writer = | ||
new BinaryBAMIndexWriter(numReferences, baiOut)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange extra line break here, these fit on one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// change is needed to support index merging. | ||
|
||
/** | ||
* IndexCreator for Tabix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above comment should be part of the javadoc I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import java.nio.file.Path; | ||
import java.util.List; | ||
|
||
public class StreamBasedTabixIndexCreator extends AllRefsTabixIndexCreator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -484,11 +484,6 @@ else if (STREAM_TYPES.contains(this.outType)) | |||
writer = createBCFWriter(outPath, outStreamFromFile); | |||
break; | |||
case VCF_STREAM: | |||
if (options.contains(Options.INDEX_ON_THE_FLY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this have to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a tabix index can be written when writing a VCF to a stream.
private final static Path BAM_FILE = new File("src/test/resources/htsjdk/samtools/BAMFileIndexTest/index_test.bam").toPath(); | ||
|
||
/** | ||
* Writes a <i>partitioned BAM</i>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a description of the file structure expected, would be useful to add to the IndexMerger abstract class. If you move those classes into a new package you could put it as package documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I haven't moved everything into one package for reasons explained above.
|
||
private BlockCompressedInputStream blockStream; | ||
|
||
public static void assertEquals(Path vcfFile, Path tbiFile1, Path tbiFile2, boolean identical) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add documentation explaining the case the identical parameter is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Make BAMIndexMerger final. Check sequence dictionaries of BAI parts. Use Long::sum method ref Line break. Document fillInUninitializedValues. Make CachingBAMFileIndexOptimized package private. Include two numbers that differ in exception message. Make new BAMIndexMetaData constructor private. Add doc for BAMStreamWriter Fixed a comment referring to an out-of-date test. Simplified CachingBAMFileIndexOptimized Add doc for IndexMerger Rename class to CachingBamFileIndexOptimizedForMerging Class doc for CRAIIndexMerger Class doc for AllRefsTabixIndexCreator CLass doc for StreamBasedTabixIndexCreator Javadoc for TbiEqualityChecker Move partitioned file format documentation to main classes
Assert number of parts is greater than one in tests Use new FileExtensions class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @lbergelson. I've addressed your feedback in the latest update, as well as some other points that came up in the meeting.
I haven't moved the merger classes into their own package since BAMIndexMerger
depends on internals from Bin
, BAMIndexMetadata
, and BAMIndexContent
that would be better not to make public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you @tomwhite!
Awesome! Thanks @lbergelson |
Description
This PR is to discuss how to introduce index merging into htsjdk. The idea is that BAI/SBI/TBI indexes can be constructed in parallel parts then merged in a final step. The code is already used successfully in Disq to write indexes in parallel on the fly, which is more efficient than a final, serial indexing step.
Disq uses Spark for parallel processing, but the code here does not have any Spark dependencies.
The main additions are:
The indexes being merged depend implicitly on the indexed files having been written in parts too (in headerless form). Do we need to add something to do this to htsjdk too?
Any thoughts on how this addition should be structured? E.g. if it should be broken into pieces; thoughts on testing etc.
cc @lbergelson
Checklist