-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify container header fields. Resolves #396 #398
Conversation
(As a reminder: please create PR branches in a fork of the hts-specs repository — see MAINTAINERS.md. But this time it was a useful test to highlight #399 😄) |
Oops, sorry about that. |
CRAMv3.tex
Outdated
blocks byte array. Landmarks are used for random access indexing.\tabularnewline | ||
itf8[ ] & landmarks & the locations of slices in this container as byte offsets from the end of | ||
this container header, used for random access indexing. | ||
The landmark count must equal the slice count.\tabularnewline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this definition imply that the first landmark entry would always be 0, since the first slice header would always be immediately after the container header ? That doesn't seem to match the cram_dump output that @jkbonfield put in the comment in #396, nor what I see in htsjdk code:
final int containerHeaderSize = landmarks[0];
final int containerTotalByteSize = containerHeaderSize + containerByteSize;
Not sure which is incorrect, but the code seems to count the header size in the landmark offset, as if it were an offset from the beginning of the header/container, not the end of the header ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
landmarks[0] = compression header length. We could add a note here saying so.
I can make a quick PR to name this value more precisely in htsjdk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, these states are equivalent for containers:
- no landmarks = no slices = no compression header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - I wasn't properly distinguishing compression header from container header. I was thinking of them as a single header, but that doesn't match the spec treatment. I think your suggestion of noting that here is a good one, and would make that explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. This is a case of "if only!". It doesn't make sense as if we have slices we have to decode container structure and the compression header, so the seek should be relative to that location, not relative to between the structure end and the compression header start.
It's a case of "I wouldn't start from here"...
I can try to make it more explicit. (Edit: although isn't that what the picture is stating?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new picture. Yes, it has helped to clarify this confusion.
I updated the text here in a new commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got muddled with which this PR was (too many related ones), sorry, but yes the picture elsewhere clarifies things there and your explicit comment regarding landmark[0] being size of the compression header helps a lot.
- added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor quibble only on block length wording that I'll accept either way. A good improvement especially the landmark clarification. Approved.
CRAMv3.tex
Outdated
@@ -440,7 +440,8 @@ \section{\textbf{Container structure}} | |||
\textbf{Data type} & \textbf{Name} & \textbf{Value} | |||
\tabularnewline | |||
\hline | |||
int32 & length & byte size of the container data (blocks)\tabularnewline | |||
int32 & length & the sum of the byte lengths of all blocks in this container; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"byte lengths of all blocks" may be ambiguous. Are we talking the size of all data[] parts of blocks, or also including the block structure itself? We know it's the latter, but it may not be clear from this wording given "byte lengths" and the fact that blocks have a "byte[]" array in their structure.
To be totally unambiguous, maybe use "the sum of the lengths of all blocks (both structures and data) in this container".
Yep, it's the wrong one page. Yossi's been looking into that — cf #400 and the issue linked from there. |
* Revert #1326 because it was correct before - added comments clarifying the real situation - see samtools/hts-specs#396 - see samtools/hts-specs#398 - comments and rename CRAIEntry.sliceByteOffset - blockCount comments
One additional issue, though perhaps for another PR. Otherwise this looks good to me. |
Co-Authored-By: jmthibault79 <thibault@broadinstitute.org>
This was added as clarification in samtools#398 after discussion in samtools#396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
This was added as clarification in samtools#398 after discussion in samtools#396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
This was added as clarification in #398 after discussion in #396, but this was in error. In our attempts to clarify and nail down these corner cases, we failed to recall that the SAM header is permitted to be padded out by non-block allocated space. History on this decision dates back to 2013 and is show in Samtools issue samtools/samtools#1852. There are good reasons for changing away from the decision of padding via a second block, as changing block sizes can also change block structure size (if we're using a generic shared piece of code, due to ITF8 being a variable length integer), and this in turn makes it cumbersome to handle every possible change in SAM header size. It is far easier and simpler to just have unallocated space after the block and before the end of the container. This is how htslib works since CRAM 3.0 and I believe how CRAMtools.jar works. Fixes samtools/samtools#1852.
No description provided.