Index merging #1263

tomwhite · 2019-01-23T10:09:10Z

Description

This PR is to discuss how to introduce index merging into htsjdk. The idea is that BAI/SBI/TBI indexes can be constructed in parallel parts then merged in a final step. The code is already used successfully in Disq to write indexes in parallel on the fly, which is more efficient than a final, serial indexing step.

Disq uses Spark for parallel processing, but the code here does not have any Spark dependencies.

The main additions are:

Merger classes (BAMIndexMerger, SBIIndexMerger, TabixIndexMerger)
Utility methods for shifting virtual file pointers by an offset
Visibility changes

The indexes being merged depend implicitly on the indexed files having been written in parts too (in headerless form). Do we need to add something to do this to htsjdk too?

Any thoughts on how this addition should be structured? E.g. if it should be broken into pieces; thoughts on testing etc.

cc @lbergelson

Checklist

Code compiles correctly
New tests covering changes and new functionality
All tests passing
Extended the README / documentation, if necessary
Is not backward compatible (breaks binary or source compatibility)

yfarjoun · 2019-06-17T18:56:48Z

What is the state of this? is it waiting for a htsspecs pr? let's rebase this and move it forward please!

tomwhite · 2019-06-17T19:50:24Z

@yfarjoun it's waiting for a review.

tomwhite · 2019-06-18T09:10:59Z

I've rebased this, and also added a CRAIIndexMerger as well as some bug fixes for the tabix merging path.

I think the main point for discussion is how to structure these changes with regard to testing. There are no direct unit tests as the testing is in Disq (e.g. https://github.com/disq-bio/disq/blob/master/src/test/java/org/disq_bio/disq/BaiMergingTest.java). They would be hard to move here without also exposing code to produce headerless part files that are the files to be indexed. I'd be interested to hear people's thoughts on this.

codecov-io · 2019-06-18T09:15:55Z

Codecov Report

Merging #1263 into master will increase coverage by 0.329%.
The diff coverage is 83.894%.

@@               Coverage Diff               @@
##              master     #1263       +/-   ##
===============================================
+ Coverage     68.046%   68.375%   +0.329%     
- Complexity      8373      8485      +112     
===============================================
  Files            573       583       +10     
  Lines          33893     34372      +479     
  Branches        5663      5731       +68     
===============================================
+ Hits           23063     23502      +439     
- Misses          8642      8649        +7     
- Partials        2188      2221       +33

Impacted Files	Coverage Δ	Complexity Δ
...main/java/htsjdk/samtools/BinningIndexContent.java	`66.197% <ø> (ø)`	`14 <0> (ø)`	⬇️
...antcontext/writer/VariantContextWriterBuilder.java	`83.439% <ø> (+2.814%)`	`60 <0> (+1)`	⬆️
src/main/java/htsjdk/samtools/SAMRecord.java	`68.323% <ø> (ø)`	`297 <0> (ø)`	⬇️
...til/TerminatorlessBlockCompressedOutputStream.java	`100% <100%> (ø)`	`2 <2> (?)`
src/main/java/htsjdk/samtools/Chunk.java	`85.714% <100%> (+0.714%)`	`29 <1> (+1)`	⬆️
src/main/java/htsjdk/samtools/IndexMerger.java	`100% <100%> (ø)`	`1 <1> (?)`
...jdk/samtools/util/BlockCompressedOutputStream.java	`79.71% <100%> (+0.451%)`	`32 <1> (+2)`	⬆️
...in/java/htsjdk/tribble/index/tabix/TabixIndex.java	`80% <100%> (+0.732%)`	`41 <1> (+2)`	⬆️
...main/java/htsjdk/samtools/BinningIndexBuilder.java	`90.566% <100%> (+4.601%)`	`16 <1> (+1)`	⬆️
...bble/index/tabix/StreamBasedTabixIndexCreator.java	`100% <100%> (ø)`	`2 <2> (?)`
... and 44 more

lbergelson · 2019-07-01T18:21:24Z

@tomwhite What would we need to push to htsjdk in order to run tests on this? Do we need the ability to generate headerless bam snippets? Could we push that to htsjdk first and then port this with some tests?

tomwhite · 2019-07-02T11:09:37Z

What would we need to push to htsjdk in order to run tests on this? Do we need the ability to generate headerless bam snippets?

Yes, that's what we need. I will write something to do this so we can test the code in this PR.

tomwhite · 2019-07-15T13:55:45Z

@lbergelson @cmnbroad I've now added tests for all the merge pathways, see BAMMergerTest, CRAMMergerTest, VCFMergerTest.

I've left the code for constructing partitioned BAM files (i.e. with header, part files, terminator) in the tests themselves. This is because it's not clear what the API for writing partitioned files should look like in general. It's not a SAMFileWriter, since that is inherently sequential. Disq uses Hadoop files to write, and we don't want to bring in that dependency. I'm not sure we want to create an API for this in the absence of another driving use case, so perhaps it's fine to leave this code in tests, and what we need is a clear description of the partitioned format for BAM, CRAM and VCF?

tomwhite · 2019-07-16T09:55:27Z

Following discussion with @lbergelson I've documented the file format for partitioned BAM, CRAM, and VCF.

This should be ready for review now.

src/main/java/htsjdk/samtools/BAMIndexMerger.java

Co-Authored-By: Phil Shapiro <pshapiro@broadinstitute.org>

tomwhite · 2019-09-12T16:23:28Z

(Some notes from a code review session with @lbergelson and @droazen.)

The high-level idea is that BAM, CRAM, or VCF files can be written in partitioned format (which for BAM is defined in BAMMergerTest).

Note that the partitioned file format is defined in a test, not a main class since there are no public implementations. This is because there is no driving use case in htsjdk itself. If there was we’d have to decide things like: what is the concurrency model for writing partitions in parallel, which is out of scope for this PR.

There is a new abstract class, IndexMerger. Its subclasses know how to take a partitioned file and merge it into a single top-level file.

There are implementations for two types of BAM index (bai, sbi), CRAM (crai), and Tabix (tbi).

There is a lot of detail in the implementations. Mainly it is about how to shift virtual offsets - e.g. if a part file begins at (non-virtual) offset x, then what are the virtual offsets in the part file index changed to? Answer: BlockCompressedFilePointerUtil#shift. The changes for BAM include changes to add a shift method to BAMIndexMetaData, Bin, Chunk, and so on. SBI is simpler since it is just a file of offsets. TBI is very similar to BAI - mainly just slight API differences in htsjdk. CRAI is simple compared to BAI.

Complications

CachingBAMFileIndex doesn’t allow you to tell if there are no bins for a given reference. So we have CachingBAMFileIndexOptimized, which returns a null BAMIndexContent (see BAMIndexMerger#mergeBAMIndexContent). This is important for merging. (TODO: more detail?)
We need to make BAM indexes slightly different. We need to be able to “distinguish between windows that have no overlapping features, and those whose lowest offset is 0”, which possible since part files are headerless. (See BinningIndexBuilder fillInUninitializedValues, and BAMIndexMerger#mergeLinearIndexes where UNINITIALIZED_WINDOW is used)
AllRefsTabixIndexCreator. Sequence names are populated from the header, not from the ones that are seen. This is needed since a part file may only have a subset of the sequences in it.
We need a way of writing partitioned files. As mentioned earlier, this is in the tests, since we don’t want a Hadoop dependency (that’s in Disq), and we don’t want to make up a public API for this (yet). See BAMMergerTest#PartitionedBAMFileWriter. Also, PartitionedBAMFileMerger exercises BAMIndexMerger.

Impact

New classes (IndexMerger). Not for direct use by users.
Existing code paths unchanged.
Tests.
A few visibility changes (private -> public)

lbergelson

@tomwhite A bunch of minor requests, mostly documentation and file organization. I think it would be good to put these all together in 1 package if possible to reduce visibility to some of the auxiliary classes and make it clear how they relate to eachother. We also need documentation somewhere in the main code explaining the point of this whole thing. That sort of exists in the tests but it would be good to have it in the actual code as well.

The copy of the tabix indexer bugs me, but I poked at it a bit and it didn't seem like there was an easy way to reduce the code copying. If you can find a clever way to refactor it so it's cleaner then 👍 but I didn't see a great way.

lbergelson · 2019-02-19T21:48:57Z

src/main/java/htsjdk/samtools/BAMIndexMerger.java

+        final Bin bin = new Bin(referenceSequence, binNumber);
+        for (Chunk newChunk : allChunks) {
+            // logic is from BinningIndexBuilder#processFeature
+            final long chunkStart = newChunk.getChunkStart();


I think this block could probably be extracted as a method Bin.addChunk() and replace duplicate code here, CRAMBAIIndexer, and BinningIndexBuilder

lbergelson · 2019-02-19T21:53:54Z

src/main/java/htsjdk/samtools/BAMIndexMerger.java

+ * Merges BAM index files for (headerless) parts of a BAM file into a single
+ * index file. The index files must have been produced using an uninitialized window (TODO).
+ */
+public class BAMIndexMerger {


make this final and give it a private no-args constructor

Made class final. We can't give it a private no-args constructor as it is referenced from the tests (and Disq), and there are two required final args.

lbergelson · 2019-02-19T21:59:16Z

src/main/java/htsjdk/samtools/BAMIndexMerger.java

+        }
+        final int numReferences = bais.get(0).getNumberOfReferences();
+        for (AbstractBAMFileIndex bai : bais) {
+            if (bai.getNumberOfReferences() != numReferences) {


Should this be a stronger check of comparing sequence dictionaries?

lbergelson · 2019-02-19T21:59:37Z

src/main/java/htsjdk/samtools/BAMIndexMerger.java

+        }
+        final SAMSequenceDictionary dict = header.getSequenceDictionary();
+        final List<AbstractBAMFileIndex> bais = new ArrayList<>();
+        for (SeekableStream baiStream : baiStreams) {


This might be one of the rare use cases for a parallel stream.

The index files are all read into memory (in the test) so each call to AbstractBAMFileIndex is a a memory-backed index, so there's probably not much call for a parallel stream. It would be easy to change later if it showed a performance improvement.

lbergelson · 2019-02-19T22:02:33Z

src/main/java/htsjdk/samtools/BAMIndexMerger.java

+        Arrays.parallelPrefix(offsets, (a, b) -> a + b); // cumulative offsets
+
+        try (BinaryBAMIndexWriter writer =
+                     new BinaryBAMIndexWriter(numReferences, baiOut)) {


Strange extra line break here, these fit on one line.

lbergelson · 2019-09-17T18:41:50Z

src/main/java/htsjdk/tribble/index/tabix/AllRefsTabixIndexCreator.java

+// change is needed to support index merging.
+
+/**
+ * IndexCreator for Tabix.


The above comment should be part of the javadoc I think.

lbergelson · 2019-09-17T18:42:32Z

src/main/java/htsjdk/tribble/index/tabix/StreamBasedTabixIndexCreator.java

+import java.nio.file.Path;
+import java.util.List;
+
+public class StreamBasedTabixIndexCreator extends AllRefsTabixIndexCreator {


lbergelson · 2019-09-17T19:04:52Z

src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java

@@ -484,11 +484,6 @@ else if (STREAM_TYPES.contains(this.outType))
                writer = createBCFWriter(outPath, outStreamFromFile);
                break;
            case VCF_STREAM:
-                if (options.contains(Options.INDEX_ON_THE_FLY)) {


why does this have to be removed?

So a tabix index can be written when writing a VCF to a stream.

lbergelson · 2019-09-17T19:06:28Z

src/test/java/htsjdk/samtools/BAMMergerTest.java

+    private final static Path BAM_FILE = new File("src/test/resources/htsjdk/samtools/BAMFileIndexTest/index_test.bam").toPath();
+
+    /**
+     * Writes a <i>partitioned BAM</i>.


I think a description of the file structure expected, would be useful to add to the IndexMerger abstract class. If you move those classes into a new package you could put it as package documentation.

Done. I haven't moved everything into one package for reasons explained above.

lbergelson · 2019-09-17T19:41:26Z

src/test/java/htsjdk/tribble/index/tabix/TbiEqualityChecker.java

+
+    private BlockCompressedInputStream blockStream;
+
+    public static void assertEquals(Path vcfFile, Path tbiFile1, Path tbiFile2, boolean identical)


could you add documentation explaining the case the identical parameter is for

Make BAMIndexMerger final. Check sequence dictionaries of BAI parts. Use Long::sum method ref Line break. Document fillInUninitializedValues. Make CachingBAMFileIndexOptimized package private. Include two numbers that differ in exception message. Make new BAMIndexMetaData constructor private. Add doc for BAMStreamWriter Fixed a comment referring to an out-of-date test. Simplified CachingBAMFileIndexOptimized Add doc for IndexMerger Rename class to CachingBamFileIndexOptimizedForMerging Class doc for CRAIIndexMerger Class doc for AllRefsTabixIndexCreator CLass doc for StreamBasedTabixIndexCreator Javadoc for TbiEqualityChecker Move partitioned file format documentation to main classes

Assert number of parts is greater than one in tests Use new FileExtensions class

tomwhite

Thanks for the review @lbergelson. I've addressed your feedback in the latest update, as well as some other points that came up in the meeting.

I haven't moved the merger classes into their own package since BAMIndexMerger depends on internals from Bin, BAMIndexMetadata, and BAMIndexContent that would be better not to make public.

lbergelson

Looks good. Thank you @tomwhite!

tomwhite · 2019-10-26T09:02:59Z

Awesome! Thanks @lbergelson

tomwhite mentioned this pull request Jan 23, 2019

Remove index merging code from src/htsjdk disq-bio/disq#81

Closed

lbergelson self-requested a review February 25, 2019 19:59

lbergelson self-assigned this Feb 25, 2019

tomwhite force-pushed the bai_sbi_tbi_merging branch from 35e013e to f67022f Compare March 11, 2019 10:03

tomwhite force-pushed the bai_sbi_tbi_merging branch from f67022f to 4ed2460 Compare June 18, 2019 09:03

tomwhite force-pushed the bai_sbi_tbi_merging branch from 4ed2460 to f18aa28 Compare June 18, 2019 09:13

tomwhite mentioned this pull request Jun 24, 2019

Cannot merge tabix files with different sequence names disq-bio/disq#108

Merged

tomwhite added 14 commits July 16, 2019 09:37

Improvements to support BAI file merging

11ffa68

Add code to merge SBI and BAI files

1023d90

Add fillInUninitializedValues arg to BAMIndexer

2620eac

Make some methods public for tabix merger

a00f64c

Expose BinningIndexContent array in TabixIndex

9eb00e5

TabixIndexMerger

173ea7a

Allow VCF index creation for stream output.

d1e1683

StreamBasedTabixIndexCreator

63eec73

Add stream-based merging

807e8e1

Add AllRefsTabixIndexCreator

3fc1e04

Add CRAIIndexMerger

c80d522

Add changes from disq-bio/disq#109

e43fbac

Add tests.

c4fec3c

Fix method visibility

9bf8d90

tomwhite force-pushed the bai_sbi_tbi_merging branch from cefdb4b to 9bf8d90 Compare July 16, 2019 08:40

Add documentation to define partitioned BAM, CRAM, VCF

9277912

yfarjoun added the Waiting for Review This PR is waiting for a reviewer to respond label Jul 29, 2019

pshapiro4broad reviewed Jul 29, 2019

View reviewed changes

src/main/java/htsjdk/samtools/BAMIndexMerger.java Outdated Show resolved Hide resolved

Update src/main/java/htsjdk/samtools/BAMIndexMerger.java

3432935

Co-Authored-By: Phil Shapiro <pshapiro@broadinstitute.org>

lbergelson requested changes Sep 17, 2019

View reviewed changes

tomwhite added 2 commits September 23, 2019 15:29

Clarify virtual file position comparison in VCFMergerTest

33a9150

Assert number of parts is greater than one in tests Use new FileExtensions class

tomwhite commented Sep 23, 2019

View reviewed changes

lbergelson approved these changes Oct 25, 2019

View reviewed changes

lbergelson merged commit 3f8ba35 into samtools:master Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index merging #1263

Index merging #1263

tomwhite commented Jan 23, 2019

yfarjoun commented Jun 17, 2019

tomwhite commented Jun 17, 2019

tomwhite commented Jun 18, 2019

codecov-io commented Jun 18, 2019 •

edited

Loading

lbergelson commented Jul 1, 2019

tomwhite commented Jul 2, 2019

tomwhite commented Jul 15, 2019

tomwhite commented Jul 16, 2019

tomwhite commented Sep 12, 2019

lbergelson left a comment

lbergelson Feb 19, 2019

tomwhite Sep 23, 2019

lbergelson Feb 19, 2019

tomwhite Sep 23, 2019

lbergelson Feb 19, 2019

tomwhite Sep 23, 2019

lbergelson Feb 19, 2019

tomwhite Sep 23, 2019

lbergelson Feb 19, 2019

tomwhite Sep 23, 2019

lbergelson Sep 17, 2019

tomwhite Sep 23, 2019

lbergelson Sep 17, 2019

tomwhite Sep 23, 2019

lbergelson Sep 17, 2019

tomwhite Sep 23, 2019

lbergelson Sep 17, 2019

tomwhite Sep 23, 2019

lbergelson Sep 17, 2019

tomwhite Sep 23, 2019

tomwhite left a comment

lbergelson left a comment

tomwhite commented Oct 26, 2019


		private BlockCompressedInputStream blockStream;

		public static void assertEquals(Path vcfFile, Path tbiFile1, Path tbiFile2, boolean identical)

Index merging #1263

Index merging #1263

Conversation

tomwhite commented Jan 23, 2019

Description

Checklist

yfarjoun commented Jun 17, 2019

tomwhite commented Jun 17, 2019

tomwhite commented Jun 18, 2019

codecov-io commented Jun 18, 2019 • edited Loading

Codecov Report

lbergelson commented Jul 1, 2019

tomwhite commented Jul 2, 2019

tomwhite commented Jul 15, 2019

tomwhite commented Jul 16, 2019

tomwhite commented Sep 12, 2019

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwhite left a comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

tomwhite commented Oct 26, 2019

codecov-io commented Jun 18, 2019 •

edited

Loading