Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index merging #1263

Merged
merged 18 commits into from
Oct 25, 2019
Merged

Index merging #1263

merged 18 commits into from
Oct 25, 2019

Conversation

tomwhite
Copy link
Contributor

Description

This PR is to discuss how to introduce index merging into htsjdk. The idea is that BAI/SBI/TBI indexes can be constructed in parallel parts then merged in a final step. The code is already used successfully in Disq to write indexes in parallel on the fly, which is more efficient than a final, serial indexing step.

Disq uses Spark for parallel processing, but the code here does not have any Spark dependencies.

The main additions are:

  • Merger classes (BAMIndexMerger, SBIIndexMerger, TabixIndexMerger)
  • Utility methods for shifting virtual file pointers by an offset
  • Visibility changes

The indexes being merged depend implicitly on the indexed files having been written in parts too (in headerless form). Do we need to add something to do this to htsjdk too?

Any thoughts on how this addition should be structured? E.g. if it should be broken into pieces; thoughts on testing etc.

cc @lbergelson

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@yfarjoun
Copy link
Contributor

What is the state of this? is it waiting for a htsspecs pr? let's rebase this and move it forward please!

@tomwhite
Copy link
Contributor Author

@yfarjoun it's waiting for a review.

@tomwhite tomwhite force-pushed the bai_sbi_tbi_merging branch from f67022f to 4ed2460 Compare June 18, 2019 09:03
@tomwhite
Copy link
Contributor Author

I've rebased this, and also added a CRAIIndexMerger as well as some bug fixes for the tabix merging path.

I think the main point for discussion is how to structure these changes with regard to testing. There are no direct unit tests as the testing is in Disq (e.g. https://github.com/disq-bio/disq/blob/master/src/test/java/org/disq_bio/disq/BaiMergingTest.java). They would be hard to move here without also exposing code to produce headerless part files that are the files to be indexed. I'd be interested to hear people's thoughts on this.

@tomwhite tomwhite force-pushed the bai_sbi_tbi_merging branch from 4ed2460 to f18aa28 Compare June 18, 2019 09:13
@codecov-io
Copy link

codecov-io commented Jun 18, 2019

Codecov Report

Merging #1263 into master will increase coverage by 0.329%.
The diff coverage is 83.894%.

@@               Coverage Diff               @@
##              master     #1263       +/-   ##
===============================================
+ Coverage     68.046%   68.375%   +0.329%     
- Complexity      8373      8485      +112     
===============================================
  Files            573       583       +10     
  Lines          33893     34372      +479     
  Branches        5663      5731       +68     
===============================================
+ Hits           23063     23502      +439     
- Misses          8642      8649        +7     
- Partials        2188      2221       +33
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/samtools/BinningIndexContent.java 66.197% <ø> (ø) 14 <0> (ø) ⬇️
...antcontext/writer/VariantContextWriterBuilder.java 83.439% <ø> (+2.814%) 60 <0> (+1) ⬆️
src/main/java/htsjdk/samtools/SAMRecord.java 68.323% <ø> (ø) 297 <0> (ø) ⬇️
...til/TerminatorlessBlockCompressedOutputStream.java 100% <100%> (ø) 2 <2> (?)
src/main/java/htsjdk/samtools/Chunk.java 85.714% <100%> (+0.714%) 29 <1> (+1) ⬆️
src/main/java/htsjdk/samtools/IndexMerger.java 100% <100%> (ø) 1 <1> (?)
...jdk/samtools/util/BlockCompressedOutputStream.java 79.71% <100%> (+0.451%) 32 <1> (+2) ⬆️
...in/java/htsjdk/tribble/index/tabix/TabixIndex.java 80% <100%> (+0.732%) 41 <1> (+2) ⬆️
...main/java/htsjdk/samtools/BinningIndexBuilder.java 90.566% <100%> (+4.601%) 16 <1> (+1) ⬆️
...bble/index/tabix/StreamBasedTabixIndexCreator.java 100% <100%> (ø) 2 <2> (?)
... and 44 more

@lbergelson
Copy link
Member

@tomwhite What would we need to push to htsjdk in order to run tests on this? Do we need the ability to generate headerless bam snippets? Could we push that to htsjdk first and then port this with some tests?

@tomwhite
Copy link
Contributor Author

tomwhite commented Jul 2, 2019

What would we need to push to htsjdk in order to run tests on this? Do we need the ability to generate headerless bam snippets?

Yes, that's what we need. I will write something to do this so we can test the code in this PR.

@tomwhite
Copy link
Contributor Author

@lbergelson @cmnbroad I've now added tests for all the merge pathways, see BAMMergerTest, CRAMMergerTest, VCFMergerTest.

I've left the code for constructing partitioned BAM files (i.e. with header, part files, terminator) in the tests themselves. This is because it's not clear what the API for writing partitioned files should look like in general. It's not a SAMFileWriter, since that is inherently sequential. Disq uses Hadoop files to write, and we don't want to bring in that dependency. I'm not sure we want to create an API for this in the absence of another driving use case, so perhaps it's fine to leave this code in tests, and what we need is a clear description of the partitioned format for BAM, CRAM and VCF?

@tomwhite tomwhite force-pushed the bai_sbi_tbi_merging branch from cefdb4b to 9bf8d90 Compare July 16, 2019 08:40
@tomwhite
Copy link
Contributor Author

Following discussion with @lbergelson I've documented the file format for partitioned BAM, CRAM, and VCF.

This should be ready for review now.

@yfarjoun yfarjoun added the Waiting for Review This PR is waiting for a reviewer to respond label Jul 29, 2019
Co-Authored-By: Phil Shapiro <pshapiro@broadinstitute.org>
@tomwhite
Copy link
Contributor Author

(Some notes from a code review session with @lbergelson and @droazen.)

The high-level idea is that BAM, CRAM, or VCF files can be written in partitioned format (which for BAM is defined in BAMMergerTest).

Note that the partitioned file format is defined in a test, not a main class since there are no public implementations. This is because there is no driving use case in htsjdk itself. If there was we’d have to decide things like: what is the concurrency model for writing partitions in parallel, which is out of scope for this PR.

There is a new abstract class, IndexMerger. Its subclasses know how to take a partitioned file and merge it into a single top-level file.

There are implementations for two types of BAM index (bai, sbi), CRAM (crai), and Tabix (tbi).

There is a lot of detail in the implementations. Mainly it is about how to shift virtual offsets - e.g. if a part file begins at (non-virtual) offset x, then what are the virtual offsets in the part file index changed to? Answer: BlockCompressedFilePointerUtil#shift. The changes for BAM include changes to add a shift method to BAMIndexMetaData, Bin, Chunk, and so on. SBI is simpler since it is just a file of offsets. TBI is very similar to BAI - mainly just slight API differences in htsjdk. CRAI is simple compared to BAI.

Complications

  • CachingBAMFileIndex doesn’t allow you to tell if there are no bins for a given reference. So we have CachingBAMFileIndexOptimized, which returns a null BAMIndexContent (see BAMIndexMerger#mergeBAMIndexContent). This is important for merging. (TODO: more detail?)
  • We need to make BAM indexes slightly different. We need to be able to “distinguish between windows that have no overlapping features, and those whose lowest offset is 0”, which possible since part files are headerless. (See BinningIndexBuilder fillInUninitializedValues, and BAMIndexMerger#mergeLinearIndexes where UNINITIALIZED_WINDOW is used)
  • AllRefsTabixIndexCreator. Sequence names are populated from the header, not from the ones that are seen. This is needed since a part file may only have a subset of the sequences in it.
  • We need a way of writing partitioned files. As mentioned earlier, this is in the tests, since we don’t want a Hadoop dependency (that’s in Disq), and we don’t want to make up a public API for this (yet). See BAMMergerTest#PartitionedBAMFileWriter. Also, PartitionedBAMFileMerger exercises BAMIndexMerger.

Impact

  • New classes (IndexMerger). Not for direct use by users.
  • Existing code paths unchanged.
  • Tests.
  • A few visibility changes (private -> public)

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite A bunch of minor requests, mostly documentation and file organization. I think it would be good to put these all together in 1 package if possible to reduce visibility to some of the auxiliary classes and make it clear how they relate to eachother. We also need documentation somewhere in the main code explaining the point of this whole thing. That sort of exists in the tests but it would be good to have it in the actual code as well.

The copy of the tabix indexer bugs me, but I poked at it a bit and it didn't seem like there was an easy way to reduce the code copying. If you can find a clever way to refactor it so it's cleaner then 👍 but I didn't see a great way.

final Bin bin = new Bin(referenceSequence, binNumber);
for (Chunk newChunk : allChunks) {
// logic is from BinningIndexBuilder#processFeature
final long chunkStart = newChunk.getChunkStart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block could probably be extracted as a method Bin.addChunk() and replace duplicate code here, CRAMBAIIndexer, and BinningIndexBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Merges BAM index files for (headerless) parts of a BAM file into a single
* index file. The index files must have been produced using an uninitialized window (TODO).
*/
public class BAMIndexMerger {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this final and give it a private no-args constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made class final. We can't give it a private no-args constructor as it is referenced from the tests (and Disq), and there are two required final args.

}
final int numReferences = bais.get(0).getNumberOfReferences();
for (AbstractBAMFileIndex bai : bais) {
if (bai.getNumberOfReferences() != numReferences) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a stronger check of comparing sequence dictionaries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
final SAMSequenceDictionary dict = header.getSequenceDictionary();
final List<AbstractBAMFileIndex> bais = new ArrayList<>();
for (SeekableStream baiStream : baiStreams) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be one of the rare use cases for a parallel stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index files are all read into memory (in the test) so each call to AbstractBAMFileIndex is a a memory-backed index, so there's probably not much call for a parallel stream. It would be easy to change later if it showed a performance improvement.

Arrays.parallelPrefix(offsets, (a, b) -> a + b); // cumulative offsets

try (BinaryBAMIndexWriter writer =
new BinaryBAMIndexWriter(numReferences, baiOut)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange extra line break here, these fit on one line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// change is needed to support index merging.

/**
* IndexCreator for Tabix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above comment should be part of the javadoc I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.nio.file.Path;
import java.util.List;

public class StreamBasedTabixIndexCreator extends AllRefsTabixIndexCreator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -484,11 +484,6 @@ else if (STREAM_TYPES.contains(this.outType))
writer = createBCFWriter(outPath, outStreamFromFile);
break;
case VCF_STREAM:
if (options.contains(Options.INDEX_ON_THE_FLY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a tabix index can be written when writing a VCF to a stream.

private final static Path BAM_FILE = new File("src/test/resources/htsjdk/samtools/BAMFileIndexTest/index_test.bam").toPath();

/**
* Writes a <i>partitioned BAM</i>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a description of the file structure expected, would be useful to add to the IndexMerger abstract class. If you move those classes into a new package you could put it as package documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I haven't moved everything into one package for reasons explained above.


private BlockCompressedInputStream blockStream;

public static void assertEquals(Path vcfFile, Path tbiFile1, Path tbiFile2, boolean identical)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add documentation explaining the case the identical parameter is for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Make BAMIndexMerger final.

Check sequence dictionaries of BAI parts.

Use Long::sum method ref

Line break.

Document fillInUninitializedValues.

Make CachingBAMFileIndexOptimized package private.

Include two numbers that differ in exception message.

Make new BAMIndexMetaData constructor private.

Add doc for BAMStreamWriter

Fixed a comment referring to an out-of-date test.

Simplified CachingBAMFileIndexOptimized

Add doc for IndexMerger

Rename class to CachingBamFileIndexOptimizedForMerging

Class doc for CRAIIndexMerger

Class doc for AllRefsTabixIndexCreator

CLass doc for StreamBasedTabixIndexCreator

Javadoc for TbiEqualityChecker

Move partitioned file format documentation to main classes
Assert number of parts is greater than one in tests

Use new FileExtensions class
Copy link
Contributor Author

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @lbergelson. I've addressed your feedback in the latest update, as well as some other points that came up in the meeting.

I haven't moved the merger classes into their own package since BAMIndexMerger depends on internals from Bin, BAMIndexMetadata, and BAMIndexContent that would be better not to make public.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you @tomwhite!

@lbergelson lbergelson merged commit 3f8ba35 into samtools:master Oct 25, 2019
@tomwhite
Copy link
Contributor Author

Awesome! Thanks @lbergelson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Review This PR is waiting for a reviewer to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants