Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM indexing tests and associated fixes and updates #1276

Merged
merged 16 commits into from
Mar 12, 2019
Merged

Conversation

jmthibault79
Copy link
Contributor

@jmthibault79 jmthibault79 commented Feb 12, 2019

Description

Adding better test coverage for CRAM indexing, which inspired:

  • a CRAIQuery class with the necessary fields from CRAIEntry for querying
  • AlignmentSpan immutability and distinguishing between mapped and unmapped read counts
  • Recording mapped, unmapped, and unplaced read counts per slice
  • requiring Coordinate-Sorted for CRAM BAI generation

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

final ContainerParser parser = new ContainerParser(indexBuilder.bamHeader);
final Map<Integer, AlignmentSpan> refSet = parser.getReferences(container, validationStringency);
final ContainerParser parser = new ContainerParser(indexBuilder.getBamHeader());
final Map<Integer, AlignmentSpan> spanMap = parser.getSpans(container, validationStringency);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more accurate names

try {
final int reference = slice.sequenceId;
if (reference == SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now handles unmapped

@@ -213,6 +224,64 @@ private void advanceToReference(final int nextReference) {
}
}

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved up from later in this file

*/
private void processSingleReferenceSlice(final Slice slice) {

// metadata
indexStats.recordMetaData(slice);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved out of this method: processSingleReferenceSlice() is not for unmapped slices but we do want to call recordMetaData() for those

indexStats.recordMetaData(slice);

final int alignmentStart = slice.alignmentStart;
if (alignmentStart == SAMRecord.NO_ALIGNMENT_START) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this check and the one below one level up the stack

@codecov-io
Copy link

codecov-io commented Feb 12, 2019

Codecov Report

Merging #1276 into master will increase coverage by 0.246%.
The diff coverage is 80.365%.

@@               Coverage Diff               @@
##              master     #1276       +/-   ##
===============================================
+ Coverage     67.626%   67.873%   +0.246%     
- Complexity      8229      8265       +36     
===============================================
  Files            562       562               
  Lines          33617     33800      +183     
  Branches        5642      5684       +42     
===============================================
+ Hits           22734     22941      +207     
+ Misses          8708      8681       -27     
- Partials        2175      2178        +3
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/CRAMIterator.java 81.752% <ø> (ø) 31 <0> (ø) ⬇️
...samtools/cram/structure/CramCompressionRecord.java 92.969% <ø> (ø) 113 <0> (ø) ⬇️
...a/htsjdk/samtools/cram/build/ContainerFactory.java 96.667% <100%> (-0.108%) 5 <0> (ø)
...oding/reader/MultiRefSliceAlignmentSpanReader.java 100% <100%> (ø) 6 <3> (+1) ⬆️
...va/htsjdk/samtools/cram/build/ContainerParser.java 96.774% <100%> (+2.435%) 10 <0> (-8) ⬇️
src/main/java/htsjdk/samtools/CRAMCRAIIndexer.java 79.167% <100%> (+0.443%) 9 <1> (ø) ⬇️
...jdk/samtools/cram/structure/CompressionHeader.java 90.303% <100%> (+0.059%) 29 <1> (+1) ⬆️
...va/htsjdk/samtools/cram/structure/ContainerIO.java 61.728% <100%> (ø) 13 <0> (ø) ⬇️
...rc/main/java/htsjdk/samtools/BAMIndexMetaData.java 77.67% <100%> (+3.838%) 22 <2> (ø) ⬇️
src/main/java/htsjdk/samtools/cram/CRAIIndex.java 88.406% <100%> (+5.69%) 26 <12> (-1) ⬇️
... and 10 more

* @param output File for output index file
* @param log optional {@link htsjdk.samtools.util.Log} to output progress
*/
public static void createIndex(final SeekableStream stream, final File output, final Log log, final ValidationStringency validationStringency) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved up in the file

@@ -55,11 +73,6 @@ public void processContainer(final Container container) {
craiIndex.processContainer(container);
}

// TODO this is only used by test code
public void addEntry(CRAIEntry entry) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests now use the new constructor

@@ -41,6 +41,28 @@ public ContainerParser(final SAMFileHeader samFileHeader) {
this.samFileHeader = samFileHeader;
}

public Map<Integer, AlignmentSpan> getSpans(final Container container, final ValidationStringency validationStringency) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the combination of the 4 methods deleted below.

final CramCompressionRecord cramRecord = new CramCompressionRecord();
super.read(cramRecord);

private void processRecordSpan(final CramCompressionRecord cramRecord) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this did two jobs (reading and processing) and was only named after the first. I moved reading outside this function and renamed it to processing.

return (int) (o1.containerStartByteOffset - o2.containerStartByteOffset);
public static final Comparator<CRAIEntry> byEnd = (final CRAIEntry o1, final CRAIEntry o2) -> {
if (o1.sequenceId != o2.sequenceId) {
return o1.sequenceId - o2.sequenceId;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was reversed: see above.

}
if (e0.sequenceId < 0) {
return false;
public static final Comparator<CRAIEntry> unmappedLast = (final CRAIEntry o1, final CRAIEntry o2) -> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formerly named byStartDesc which did not describe it well

}
indexer.finish();

return new SeekableMemoryStream(baos.toByteArray(), "CRAI to BAI converter");
}

public static List<CRAIEntry> find(final List<CRAIEntry> list, final int seqId, final int start, final int span) {
final boolean whole = start < 1 || span < 1;
final CRAIEntry query = new CRAIEntry(seqId,
start < 1 ? 1 : start,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these ternary operators are useless, since whole (now matchEntireSequence) handles this case

CRAIEntry left = list.get(0);

for (final CRAIEntry e : list) {
if (e.getAlignmentStart() < left.getAlignmentStart()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This led to the behavior where chr2:1 is "more left" than chr1:2

final int b0 = a0 + this.alignmentSpan;
final int b1 = a1 + that.alignmentSpan;

return Math.abs(a0 + b0 - a1 - b1) < (this.alignmentSpan + that.alignmentSpan);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this logic so I'm leaving it alone

/**
* Retrieve the list of CRAI Index entries corresponding to this Container.
*
* TODO: investigate why we sometimes split multi-ref Slices
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. why do we need both this method and the one above it?

this replaces CRAIIndex.processContainer()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCRAIEntries is only called in tests, so it seem like it should be removed.

I suspect we HAVE to split (create multiple entires for) multi-ref slices, since otherwise the index wouldn't accurately resolve/represent the contents of that slice. I think we previously discussed the likelihood that index queries against such slices probably don't work properly (IIRC, we hardly ever produce such slices anyway ?). I know you added some metatdata tests for those cases in this PR; but are there any file-level test where we validate a query against a file with multi-ref (ie., a slice with 3 refs and a query for the middle one, etc.).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this TODO is done (specifically, I think we always want to split ?), so we can remove the method above that doesn't split (as well as Slice.getCRAIEntry, which is only used by this method), and then update the tests that use the one above to use the splitting one ?

* because this Slice's containerOffset is incorrect
*
* @param containerStartOffset the byte offset of this Slice's Container
* @return a new CRAI Index Entry
* TODO: investigate why we sometimes split multi-ref and sometimes do not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. why do we need both this method and the one above it?

this replaces CRAIIndex.processContainer()

@@ -72,7 +72,6 @@ public void createLocallyGeneratedCRAIFiles() throws IOException {
tempCRAIOut.deleteOnExit();
createLocalCRAMAndCRAI(
cramQueryWithCRAI,
cramQueryReference,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

try (FileOutputStream bos = new FileOutputStream(outputCRAI)) {
CRAMCRAIIndexer craiIndexer = new CRAMCRAIIndexer(bos, samHeader);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to construct a CRAMCRAIIndexer instance, so we don't need the rest either

@@ -653,8 +643,7 @@ private void doQueryTest(
@Test(dataProvider="iteratorStateTests", expectedExceptions=SAMException.class, enabled=false)
public void testIteratorState(
final File cramFileName,
final File referenceFileName,
final int expectedCount) throws IOException
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

@@ -1,168 +0,0 @@
package htsjdk.samtools.cram;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to index subpackage

@@ -1,191 +0,0 @@
package htsjdk.samtools.cram;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to index subpackage

final Map<Integer, AlignmentSpan> referenceSet = parser.getReferences(container, ValidationStringency.STRICT);
Assert.assertEquals(referenceSet.keySet(), expectedKeys);
final Map<Integer, AlignmentSpan> spanMap = parser.getSpans(container, ValidationStringency.STRICT);
Assert.assertEquals(spanMap, expectedSpans);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare the AlignmentSpans, not just the keys (ref-seqs)

final Map<Integer, AlignmentSpan> expectedSpans = new HashMap<>();
for (int i = 0; i < 10; i++) {
if (i % 2 == 0) {
expectedSpans.put(i, new AlignmentSpan(i + 1, 3, 0, 1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

half mapped, half unmapped (all placed)

@@ -0,0 +1,390 @@
package htsjdk.samtools.cram.index;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partly new, partly moved from htsjdk.samtools.cram

@@ -0,0 +1,252 @@
package htsjdk.samtools.cram.index;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partly new, partly moved from htsjdk.samtools.cram

@cmnbroad cmnbroad added the cram label Mar 1, 2019
@jmthibault79 jmthibault79 force-pushed the jt_indexing branch 3 times, most recently from 6808082 to cc34bf2 Compare March 4, 2019 17:09
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite finished, but checkpointing what I have so far.

* Index a container, any of mapped, unmapped and multiple references are allowed.
* The only requirement is sort order by coordinate.
* For multiref containers the method reads the container through unpacking all reads.
* This is slower than single reference but should be faster than normal reading.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing it means "creating an index for a multi-ref container requires unpacking the records, but thats (somehow) still faster than fully reading the container". I'm speculating though.

* Index a container, any of mapped, unmapped and multiple references are allowed.
* The only requirement is sort order by coordinate.
* For multiref containers the method reads the container through unpacking all reads.
* This is slower than single reference but should be faster than normal reading.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to normalize/consolidate the BAI/CRAI indexing code ? The crai indexer delegates to container to get crai entries via getCRAIEntriesSplittingMultiRef; the BAI indexer seems to extract the slices directly and process them manually. Isn't there a lot (or nearly complete) similarity to what they're doing ? Could the the BAM index be created using the crai entries returned from getCRAIEntriesSplittingMultiRef or container.getCraiEntries ?

/**
* Retrieve the list of CRAI Index entries corresponding to this Container.
*
* TODO: investigate why we sometimes split multi-ref Slices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCRAIEntries is only called in tests, so it seem like it should be removed.

I suspect we HAVE to split (create multiple entires for) multi-ref slices, since otherwise the index wouldn't accurately resolve/represent the contents of that slice. I think we previously discussed the likelihood that index queries against such slices probably don't work properly (IIRC, we hardly ever produce such slices anyway ?). I know you added some metatdata tests for those cases in this PR; but are there any file-level test where we validate a query against a file with multi-ref (ie., a slice with 3 refs and a query for the middle one, etc.).

src/main/java/htsjdk/samtools/cram/index/CRAIQuery.java Outdated Show resolved Hide resolved
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly minor comments, in addition to a couple of other requests: removal of extra comparators if possible - I vaguely remember talking about this earlier in the project - but can't recall why we decided to keep them); request to limit CRAIEntry values on construction; and some test refactoring. As always, please let me know if you see issues with any of these suggestions.

/**
* Retrieve the list of CRAI Index entries corresponding to this Container.
*
* TODO: investigate why we sometimes split multi-ref Slices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this TODO is done (specifically, I think we always want to split ?), so we can remove the method above that doesn't split (as well as Slice.getCRAIEntry, which is only used by this method), and then update the tests that use the one above to use the splitting one ?

src/test/java/htsjdk/samtools/cram/CRAIEntryTest.java Outdated Show resolved Hide resolved
src/test/java/htsjdk/samtools/cram/CRAIEntryTest.java Outdated Show resolved Hide resolved
src/test/java/htsjdk/samtools/cram/CRAIEntryTest.java Outdated Show resolved Hide resolved
better add

define fakeSlice more explicitly

move check outside try

createUnmappedPlacedRecord / createUnmappedUnplacedRecord

CRAMBAIIndexerTest refactor

shuffle around

better CRAIEntryTest

typo: sort

cleanup CRAIIndexTest

no

better CraiQuery intersection test

recordsPerSlice can be final

recordsPerSliceTest

rename Container header to compressionHeader

ContainerParserTest improvements

typo

move ContainerParser.getSpans() to Container
…ata(SAMRecord)

- rename half-unmapped to half-unplaced where appropriate
specify 1-based coordinate system for AlignmentSpan, Container, Slice, and CramCompressionRecord
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining question about unmapped counts that Im curious about (see inline), and a couple of minor javadoc comments. Then this should be good to go.


prevAlignmentStart = super.read(cramRecord, prevAlignmentStart);
if (! cramRecord.isPlaced()) {
spans.put(ReferenceContext.UNMAPPED_UNPLACED_CONTEXT, AlignmentSpan.UNPLACED_SPAN);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we not count (keep track of) the number of records in UNMAPPED_UNPLACED_CONTEXT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slice itself holds the mapped, unmapped, and unplaced counts. We need to distinguish between the different references for mapped and unmapped counts - and that is the true (sole??) purpose of both AlignmentSpan and MultiRefSliceAlignmentSpanReader.

We can get unplaced counts by querying the Slice directly, so AlignmentSpan can ignore this.

@@ -33,6 +32,10 @@ public CRAIEntry(final int sequenceId,
final long containerStartByteOffset,
final int sliceByteOffset,
final int sliceByteSize) {
if (sequenceId == ReferenceContext.MULTIPLE_REFERENCE_ID) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday (not this PR) we can clamp this down even further; other negative values can be rejected.


return (int) (o1.containerStartByteOffset - o2.containerStartByteOffset);
if (containerStartByteOffset != other.containerStartByteOffset) {
return Long.compare(containerStartByteOffset, other.containerStartByteOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - this all seems correct now.

src/main/java/htsjdk/samtools/CRAMBAIIndexer.java Outdated Show resolved Hide resolved
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On to the next one....

@cmnbroad
Copy link
Collaborator

Thanks @jmthibault79 for all the iteration on this one.

@jmthibault79 jmthibault79 merged commit 744d87d into master Mar 12, 2019
@jmthibault79 jmthibault79 deleted the jt_indexing branch March 12, 2019 20:24
jmthibault79 added a commit that referenced this pull request Apr 1, 2019
…AIEntries() (#1329)

* Add bai and crai index init checks to Slice
* Remove multiple versions of Slice/Container getCRAIEntries()
* move Container.setByteOffset() call inside ContainerIO.readContainer()
* populateSlicesAndIndexingParameters -> distributeIndexingParametersToSlices
- called on Container write now as well
* Change setSlices signature to take a List<Slice>
* rm duplicate test code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants