-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAM indexing tests and associated fixes and updates #1276
Conversation
final ContainerParser parser = new ContainerParser(indexBuilder.bamHeader); | ||
final Map<Integer, AlignmentSpan> refSet = parser.getReferences(container, validationStringency); | ||
final ContainerParser parser = new ContainerParser(indexBuilder.getBamHeader()); | ||
final Map<Integer, AlignmentSpan> spanMap = parser.getSpans(container, validationStringency); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more accurate names
try { | ||
final int reference = slice.sequenceId; | ||
if (reference == SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now handles unmapped
@@ -213,6 +224,64 @@ private void advanceToReference(final int nextReference) { | |||
} | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved up from later in this file
*/ | ||
private void processSingleReferenceSlice(final Slice slice) { | ||
|
||
// metadata | ||
indexStats.recordMetaData(slice); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved out of this method: processSingleReferenceSlice()
is not for unmapped slices but we do want to call recordMetaData()
for those
indexStats.recordMetaData(slice); | ||
|
||
final int alignmentStart = slice.alignmentStart; | ||
if (alignmentStart == SAMRecord.NO_ALIGNMENT_START) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this check and the one below one level up the stack
Codecov Report
@@ Coverage Diff @@
## master #1276 +/- ##
===============================================
+ Coverage 67.626% 67.873% +0.246%
- Complexity 8229 8265 +36
===============================================
Files 562 562
Lines 33617 33800 +183
Branches 5642 5684 +42
===============================================
+ Hits 22734 22941 +207
+ Misses 8708 8681 -27
- Partials 2175 2178 +3
|
* @param output File for output index file | ||
* @param log optional {@link htsjdk.samtools.util.Log} to output progress | ||
*/ | ||
public static void createIndex(final SeekableStream stream, final File output, final Log log, final ValidationStringency validationStringency) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved up in the file
@@ -55,11 +73,6 @@ public void processContainer(final Container container) { | |||
craiIndex.processContainer(container); | |||
} | |||
|
|||
// TODO this is only used by test code | |||
public void addEntry(CRAIEntry entry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests now use the new constructor
@@ -41,6 +41,28 @@ public ContainerParser(final SAMFileHeader samFileHeader) { | |||
this.samFileHeader = samFileHeader; | |||
} | |||
|
|||
public Map<Integer, AlignmentSpan> getSpans(final Container container, final ValidationStringency validationStringency) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the combination of the 4 methods deleted below.
final CramCompressionRecord cramRecord = new CramCompressionRecord(); | ||
super.read(cramRecord); | ||
|
||
private void processRecordSpan(final CramCompressionRecord cramRecord) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this did two jobs (reading and processing) and was only named after the first. I moved reading outside this function and renamed it to processing.
return (int) (o1.containerStartByteOffset - o2.containerStartByteOffset); | ||
public static final Comparator<CRAIEntry> byEnd = (final CRAIEntry o1, final CRAIEntry o2) -> { | ||
if (o1.sequenceId != o2.sequenceId) { | ||
return o1.sequenceId - o2.sequenceId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was reversed: see above.
} | ||
if (e0.sequenceId < 0) { | ||
return false; | ||
public static final Comparator<CRAIEntry> unmappedLast = (final CRAIEntry o1, final CRAIEntry o2) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formerly named byStartDesc
which did not describe it well
} | ||
indexer.finish(); | ||
|
||
return new SeekableMemoryStream(baos.toByteArray(), "CRAI to BAI converter"); | ||
} | ||
|
||
public static List<CRAIEntry> find(final List<CRAIEntry> list, final int seqId, final int start, final int span) { | ||
final boolean whole = start < 1 || span < 1; | ||
final CRAIEntry query = new CRAIEntry(seqId, | ||
start < 1 ? 1 : start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these ternary operators are useless, since whole
(now matchEntireSequence
) handles this case
CRAIEntry left = list.get(0); | ||
|
||
for (final CRAIEntry e : list) { | ||
if (e.getAlignmentStart() < left.getAlignmentStart()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This led to the behavior where chr2:1
is "more left" than chr1:2
final int b0 = a0 + this.alignmentSpan; | ||
final int b1 = a1 + that.alignmentSpan; | ||
|
||
return Math.abs(a0 + b0 - a1 - b1) < (this.alignmentSpan + that.alignmentSpan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this logic so I'm leaving it alone
/** | ||
* Retrieve the list of CRAI Index entries corresponding to this Container. | ||
* | ||
* TODO: investigate why we sometimes split multi-ref Slices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. why do we need both this method and the one above it?
this replaces CRAIIndex.processContainer()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCRAIEntries is only called in tests, so it seem like it should be removed.
I suspect we HAVE to split (create multiple entires for) multi-ref slices, since otherwise the index wouldn't accurately resolve/represent the contents of that slice. I think we previously discussed the likelihood that index queries against such slices probably don't work properly (IIRC, we hardly ever produce such slices anyway ?). I know you added some metatdata tests for those cases in this PR; but are there any file-level test where we validate a query against a file with multi-ref (ie., a slice with 3 refs and a query for the middle one, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this TODO is done (specifically, I think we always want to split ?), so we can remove the method above that doesn't split (as well as Slice.getCRAIEntry
, which is only used by this method), and then update the tests that use the one above to use the splitting one ?
* because this Slice's containerOffset is incorrect | ||
* | ||
* @param containerStartOffset the byte offset of this Slice's Container | ||
* @return a new CRAI Index Entry | ||
* TODO: investigate why we sometimes split multi-ref and sometimes do not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. why do we need both this method and the one above it?
this replaces CRAIIndex.processContainer()
@@ -72,7 +72,6 @@ public void createLocallyGeneratedCRAIFiles() throws IOException { | |||
tempCRAIOut.deleteOnExit(); | |||
createLocalCRAMAndCRAI( | |||
cramQueryWithCRAI, | |||
cramQueryReference, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
try (FileOutputStream bos = new FileOutputStream(outputCRAI)) { | ||
CRAMCRAIIndexer craiIndexer = new CRAMCRAIIndexer(bos, samHeader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to construct a CRAMCRAIIndexer
instance, so we don't need the rest either
@@ -653,8 +643,7 @@ private void doQueryTest( | |||
@Test(dataProvider="iteratorStateTests", expectedExceptions=SAMException.class, enabled=false) | |||
public void testIteratorState( | |||
final File cramFileName, | |||
final File referenceFileName, | |||
final int expectedCount) throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
@@ -1,168 +0,0 @@ | |||
package htsjdk.samtools.cram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to index
subpackage
@@ -1,191 +0,0 @@ | |||
package htsjdk.samtools.cram; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to index
subpackage
final Map<Integer, AlignmentSpan> referenceSet = parser.getReferences(container, ValidationStringency.STRICT); | ||
Assert.assertEquals(referenceSet.keySet(), expectedKeys); | ||
final Map<Integer, AlignmentSpan> spanMap = parser.getSpans(container, ValidationStringency.STRICT); | ||
Assert.assertEquals(spanMap, expectedSpans); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare the AlignmentSpans, not just the keys (ref-seqs)
final Map<Integer, AlignmentSpan> expectedSpans = new HashMap<>(); | ||
for (int i = 0; i < 10; i++) { | ||
if (i % 2 == 0) { | ||
expectedSpans.put(i, new AlignmentSpan(i + 1, 3, 0, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
half mapped, half unmapped (all placed)
@@ -0,0 +1,390 @@ | |||
package htsjdk.samtools.cram.index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partly new, partly moved from htsjdk.samtools.cram
@@ -0,0 +1,252 @@ | |||
package htsjdk.samtools.cram.index; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partly new, partly moved from htsjdk.samtools.cram
6808082
to
cc34bf2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite finished, but checkpointing what I have so far.
* Index a container, any of mapped, unmapped and multiple references are allowed. | ||
* The only requirement is sort order by coordinate. | ||
* For multiref containers the method reads the container through unpacking all reads. | ||
* This is slower than single reference but should be faster than normal reading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing it means "creating an index for a multi-ref container requires unpacking the records, but thats (somehow) still faster than fully reading the container". I'm speculating though.
* Index a container, any of mapped, unmapped and multiple references are allowed. | ||
* The only requirement is sort order by coordinate. | ||
* For multiref containers the method reads the container through unpacking all reads. | ||
* This is slower than single reference but should be faster than normal reading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to normalize/consolidate the BAI/CRAI indexing code ? The crai indexer delegates to container to get crai entries via getCRAIEntriesSplittingMultiRef; the BAI indexer seems to extract the slices directly and process them manually. Isn't there a lot (or nearly complete) similarity to what they're doing ? Could the the BAM index be created using the crai entries returned from getCRAIEntriesSplittingMultiRef
or container.getCraiEntries
?
/** | ||
* Retrieve the list of CRAI Index entries corresponding to this Container. | ||
* | ||
* TODO: investigate why we sometimes split multi-ref Slices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getCRAIEntries is only called in tests, so it seem like it should be removed.
I suspect we HAVE to split (create multiple entires for) multi-ref slices, since otherwise the index wouldn't accurately resolve/represent the contents of that slice. I think we previously discussed the likelihood that index queries against such slices probably don't work properly (IIRC, we hardly ever produce such slices anyway ?). I know you added some metatdata tests for those cases in this PR; but are there any file-level test where we validate a query against a file with multi-ref (ie., a slice with 3 refs and a query for the middle one, etc.).
src/test/java/htsjdk/samtools/cram/build/ContainerParserTest.java
Outdated
Show resolved
Hide resolved
src/test/java/htsjdk/samtools/cram/build/ContainerParserTest.java
Outdated
Show resolved
Hide resolved
src/test/java/htsjdk/samtools/cram/build/ContainerParserTest.java
Outdated
Show resolved
Hide resolved
023aa2f
to
d58e57d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly minor comments, in addition to a couple of other requests: removal of extra comparators if possible - I vaguely remember talking about this earlier in the project - but can't recall why we decided to keep them); request to limit CRAIEntry values on construction; and some test refactoring. As always, please let me know if you see issues with any of these suggestions.
src/main/java/htsjdk/samtools/cram/encoding/reader/MultiRefSliceAlignmentSpanReader.java
Outdated
Show resolved
Hide resolved
/** | ||
* Retrieve the list of CRAI Index entries corresponding to this Container. | ||
* | ||
* TODO: investigate why we sometimes split multi-ref Slices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this TODO is done (specifically, I think we always want to split ?), so we can remove the method above that doesn't split (as well as Slice.getCRAIEntry
, which is only used by this method), and then update the tests that use the one above to use the splitting one ?
better add define fakeSlice more explicitly move check outside try createUnmappedPlacedRecord / createUnmappedUnplacedRecord CRAMBAIIndexerTest refactor shuffle around better CRAIEntryTest typo: sort cleanup CRAIIndexTest no better CraiQuery intersection test recordsPerSlice can be final recordsPerSliceTest rename Container header to compressionHeader ContainerParserTest improvements typo move ContainerParser.getSpans() to Container
…ata(SAMRecord) - rename half-unmapped to half-unplaced where appropriate
specify 1-based coordinate system for AlignmentSpan, Container, Slice, and CramCompressionRecord
e7e0c36
to
a4d0247
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One remaining question about unmapped counts that Im curious about (see inline), and a couple of minor javadoc comments. Then this should be good to go.
|
||
prevAlignmentStart = super.read(cramRecord, prevAlignmentStart); | ||
if (! cramRecord.isPlaced()) { | ||
spans.put(ReferenceContext.UNMAPPED_UNPLACED_CONTEXT, AlignmentSpan.UNPLACED_SPAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do we not count (keep track of) the number of records in UNMAPPED_UNPLACED_CONTEXT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Slice itself holds the mapped, unmapped, and unplaced counts. We need to distinguish between the different references for mapped and unmapped counts - and that is the true (sole??) purpose of both AlignmentSpan and MultiRefSliceAlignmentSpanReader.
We can get unplaced counts by querying the Slice directly, so AlignmentSpan can ignore this.
src/main/java/htsjdk/samtools/cram/structure/AlignmentSpan.java
Outdated
Show resolved
Hide resolved
@@ -33,6 +32,10 @@ public CRAIEntry(final int sequenceId, | |||
final long containerStartByteOffset, | |||
final int sliceByteOffset, | |||
final int sliceByteSize) { | |||
if (sequenceId == ReferenceContext.MULTIPLE_REFERENCE_ID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someday (not this PR) we can clamp this down even further; other negative values can be rejected.
|
||
return (int) (o1.containerStartByteOffset - o2.containerStartByteOffset); | ||
if (containerStartByteOffset != other.containerStartByteOffset) { | ||
return Long.compare(containerStartByteOffset, other.containerStartByteOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - this all seems correct now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On to the next one....
Thanks @jmthibault79 for all the iteration on this one. |
…AIEntries() (#1329) * Add bai and crai index init checks to Slice * Remove multiple versions of Slice/Container getCRAIEntries() * move Container.setByteOffset() call inside ContainerIO.readContainer() * populateSlicesAndIndexingParameters -> distributeIndexingParametersToSlices - called on Container write now as well * Change setSlices signature to take a List<Slice> * rm duplicate test code
Description
Adding better test coverage for CRAM indexing, which inspired:
Checklist