Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM: formalize Slice Type with an enum #1274

Merged
merged 10 commits into from
Mar 1, 2019
Merged

CRAM: formalize Slice Type with an enum #1274

merged 10 commits into from
Mar 1, 2019

Conversation

jmthibault79
Copy link
Contributor

@jmthibault79 jmthibault79 commented Feb 11, 2019

Description

Add a ReferenceContext member to Slice and Container, based on the discussion at samtools/hts-specs#370 and the related spec update at samtools/hts-specs#372.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@@ -135,7 +134,7 @@ public void processContainer(final Container container, final ValidationStringen
AlignmentSpan unmappedSpan = refSet.remove(SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX);
for (final int refId : new TreeSet<>(refSet.keySet())) {
final AlignmentSpan span = refSet.get(refId);
fakeSlice.sequenceId = refId;
fakeSlice.setSequenceId(refId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also sets the SliceType, to ensure they stay in sync

@codecov-io
Copy link

codecov-io commented Feb 11, 2019

Codecov Report

Merging #1274 into master will increase coverage by 0.027%.
The diff coverage is 84.528%.

@@               Coverage Diff               @@
##              master     #1274       +/-   ##
===============================================
+ Coverage     67.728%   67.755%   +0.027%     
- Complexity      8225      8236       +11     
===============================================
  Files            560       562        +2     
  Lines          33493     33540       +47     
  Branches        5637      5638        +1     
===============================================
+ Hits           22684     22725       +41     
- Misses          8632      8633        +1     
- Partials        2177      2182        +5
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/cram/CRAIEntry.java 71.053% <100%> (ø) 17 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/CRAMIterator.java 81.752% <100%> (+0.409%) 31 <0> (-1) ⬇️
...a/htsjdk/samtools/cram/build/ContainerFactory.java 96.774% <100%> (-0.196%) 5 <0> (-1)
src/main/java/htsjdk/samtools/cram/CRAIIndex.java 82.716% <100%> (-0.211%) 27 <1> (ø)
...jdk/samtools/cram/build/Cram2SamRecordFactory.java 93.162% <100%> (ø) 29 <0> (ø) ⬇️
...oding/reader/MultiRefSliceAlignmentSpanReader.java 100% <100%> (ø) 5 <2> (ø) ⬇️
...amtools/cram/encoding/writer/CramRecordWriter.java 89.231% <100%> (ø) 26 <0> (ø) ⬇️
...amtools/cram/encoding/reader/CramRecordReader.java 85.256% <100%> (ø) 26 <0> (ø) ⬇️
...c/main/java/htsjdk/samtools/cram/build/CramIO.java 75% <100%> (-0.51%) 20 <0> (ø)
...htsjdk/samtools/cram/ref/ReferenceContextType.java 100% <100%> (ø) 1 <1> (?)
... and 11 more

prevSeqId = Slice.MULTI_REFERENCE;
break;
default:
if (prevSeqId != container.getSequenceId()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the prevSeqId < 0 check: since we know that the slice type is single-ref, we also know that the current sequence ID here is 0+. So it will differ from any negative values.

// as defined in the specs:
public int sequenceId = -1;
private int sequenceId = -1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't modify this without also modifying type

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments/questions; haven't looked at the tests yet.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments - see what you think and maybe we can discuss when we meet.

switch (containerContext.getType()) {
case UNMAPPED:
referenceBases = new byte[]{};
prevSeqId = SAMRecord.NO_ALIGNMENT_REFERENCE_INDEX;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future candidate for turning into a ReferenceContext

@@ -208,7 +201,7 @@ protected boolean shouldFlushContainer(final SAMRecord nextRecord) {
return true;
}

if (refSeqIndex == Slice.MULTI_REFERENCE) {
if (refSeqIndex == Slice.REF_ID_MULTIPLE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refSeqIndex is a future candidate for ReferenceContext

@@ -59,60 +60,55 @@ public ContainerParser(final SAMFileHeader samFileHeader) {
return records;
}

public Map<Integer, AlignmentSpan> getReferences(final Container container, final ValidationStringency validationStringency) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someday we will merge a PR that collapses all these stupid things. I may even have such a thing open?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: that's over at #1276

cramRecord.sequenceId = refIdCodec.readData();
} else {
cramRecord.sequenceId = refId;
cramRecord.sequenceId = refContext.getSequenceIdOrSentinel();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watch out: can be unmapped here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps at some point CramRecord should have a ReferenceContext instead, or along with, a sequence ID.

container.blockCount = 1;
container.blocks = new Block[]{block};
container.landmarks = new int[0];
container.slices = new Slice[0];
container.alignmentSpan = Slice.NO_ALIGNMENT_SPAN;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these fields are now initialized to these values

final byte[] peek = new byte[4];
try {
int character = inputStream.read();
if (character == -1)
return false;
return readContainerHeader(new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was what the caller did when it received a false from this method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mildly not crazy about the recursion, but this change would make it directly recursive instead of indirectly, allow the otherwise unused method above to be deleted:

Suggested change
return readContainerHeader(new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));
return readContainerHeader(major, new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));

@@ -56,13 +56,7 @@ private static Container readContainer(final int major, final InputStream inputS
* @return a new {@link Container} object with container header values filled out but empty body (no slices and blocks).
*/
public static Container readContainerHeader(final int major, final InputStream inputStream) {
final Container container = new Container();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic moved inside ContainerHeaderIO.readContainerHeader()

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a step in the right direction. Some comments and suggestions.

final byte[] peek = new byte[4];
try {
int character = inputStream.read();
if (character == -1)
return false;
return readContainerHeader(new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for another day: Its a bit confusing to have methods called readContainerHeader and writeContainerHeader that return/accept Containers (but only partially populated ?), while Container has a header field, but the type is CramCompressionHeader.

cramRecord.sequenceId = refIdCodec.readData();
} else {
cramRecord.sequenceId = refId;
cramRecord.sequenceId = refContext.getSequenceIdOrSentinel();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps at some point CramRecord should have a ReferenceContext instead, or along with, a sequence ID.

final byte[] peek = new byte[4];
try {
int character = inputStream.read();
if (character == -1)
return false;
return readContainerHeader(new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mildly not crazy about the recursion, but this change would make it directly recursive instead of indirectly, allow the otherwise unused method above to be deleted:

Suggested change
return readContainerHeader(new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));
return readContainerHeader(major, new ByteArrayInputStream((major >= 3 ? CramIO.ZERO_F_EOF_MARKER : CramIO.ZERO_B_EOF_MARKER)));

- replace Slice.REF_ID_MULTIPLE/UNMAPPED with ReferenceContext constants
- refs -> referenceBases
- initializeFromRecords()
- init Slice Aln Start and Span
- don't return null
- endPlusOne
- rename getSequenceIdOrSentinel() to getSerializableId() and add toString()
- rm unused readContainerHeader
- test addiitonal Slice + Container fields
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from some naming comments, the code changes themselves look pretty good. Mostly requests to update some of the tests.

Also, as it stands, the ReferenceContextType enum isn't really used anywhere except tests. I'm not sure if thats because we haven't yet changed all the code to use ReferenceContexts, or if its because its redundant with ReferenceContext. Seems harmless to leave it for now, just wondering if we still need it, or if ReferenceContext makes it redundant.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, thx for the test changes. Still a couple of comments on additional code consolidation in the tests and test utils. If we don't resolve those as part of this PR, I'd request that we do so in the next PR, or even a separate PR specifically for that purpose.

Assert.assertTrue(slice.isMappedSingleRef());
}
Assert.assertEquals(slice.nofRecords, expectedRecordCount);
Assert.assertEquals(slice.bases, expectedBaseCount);
}

private CramCompressionRecord createRecord(final int index,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This createRecord method is nearly identical to createMappedRecord in CramCompressionRecordUtil. getUnplacedRecords is (completely?) identical to 'getUnmappedRecordsinCramCompressionRecordUtil`. etc.

Assert.assertEquals(container.sequenceId, slice.sequenceId);
Assert.assertTrue(container.getReferenceContext().isMappedSingleRef());
Assert.assertEquals(container.getReferenceContext().getType(), ReferenceContextType.SINGLE_REFERENCE_TYPE);
Assert.assertEquals(container.getReferenceContext(), slice.getReferenceContext());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the blocks of asserts in these tests could be replaced with one of the ContainerFactoryTest.assertContainerState methods. Not sure if SliceTests.assertSliceState could be reuseable as well.

@cmnbroad cmnbroad added the cram label Mar 1, 2019
@jmthibault79 jmthibault79 merged commit 37b2e87 into master Mar 1, 2019
@jmthibault79 jmthibault79 deleted the jt_states_enum branch March 1, 2019 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants