CRAM AP_delta query #431

jkbonfield · 2019-07-16T10:28:05Z

The CRAM spec states in the compression header "preservation map" that the AP key indicates whether the AP values are delta encoded: "true if AP data series is delta, false otherwise".

The main data series description for AP is: "if APDelta = true: 0-based alignment start delta
from the previous record. When the record is the first in the slice, its alignment start will be equal to that of the slice, so its alignment delta is 0. if APDelta = false: encodes the alignment start position directly". Thus the description is couched entirely in terms of the APdelta field in the preservation map and doesn't make reference to multi-ref slices.

Elsewhere we state "The AP data series is delta encoded for reads mapped to a position-sorted slice containing data from a single reference, and as a normal integer value in all other cases."

I assume the logic here is it could involve a large negative jump when we get to the end of one reference and the start of the next.

However, in actuality, the AP series is a signed integer. It works perfectly fine with negatives in there and can be decoded correctly by both Java and C implementations (I haven't checked JavaScript, but it's likely the same given it's just a decode of itf8). Thus I wonder if this is a pointless requirement. It harms compression considerably in certain scenarios, although possibly that's just my poor implementation. :-) It doesn't really follow that we cannot have a delta for a saw-tooth pattern of positions and the limitation seems arbitrary.

Comments? Would removing this actually be a main spec change? it doesn't feel it given this is essentially freeing up a limitation and is completely backwards compatible.

cmnbroad · 2019-07-19T14:06:35Z

@jkbonfield Just to make sure I understand; the proposal is to embrace negative AP values when APDelta is true ?

jkbonfield · 2019-07-19T15:25:19Z

Correct. As far as I can see, we already work just fine with negative AP values when applying deltas, so the restriction seems unnecessary.

I haven't checked all implementations though (notably not the JavaScript one).

cmnbroad · 2019-07-22T14:29:51Z

Seems reasonable. In theory it constrains the choice of encoding for the AP series to ones that can handle negative values, but I can't think of a case where thats a practical issue.

jkbonfield · 2019-08-20T08:51:54Z

AP uses ITF8 data type so is implicitly signed already. All the codecs that store ITF8 need to be able to handle negative values, and do, so I don't think this is a constraint. The possible problem here is on implementations that assume the delta is always positive as this has been implied in the way the current spec is written.

The circumstances where it matters is if we are using multi-ref slices. Eg maybe we have large slices but permit switching reference during the slice using the RI data series. In these cases we are currently forced to disable the delta encoding on AP, which seems both arbitrary and is also a bad decision as far as data compression goes. However it's not a common scenario (or shouldn't be, but I had a coding bug that got stuck in a rut on multi-ref mode) so I could accept not changing it if we get a good reason to keep the status quo.

cmnbroad · 2019-09-25T15:15:39Z

@jkbonfield If we allowed negative deltas, and enabled APDelta=true for multi-ref slices, how would the alignment for first record in such a slice be established on read ? Multi-ref slices are defined to always have 0 for start/span.

jkbonfield · 2019-09-25T16:52:05Z

That's a good question. The specification states for slices with data mapped to multiple references that the container structure reference sequence id is -2. For containers it doesn't elaborate on what the starting position and span are for this situation - only that they are 0 for unmapped reads. For slices we do however define it to be 0 for unmapped or multi-ref. That's an oversight in the container structure that I'll correct.

Elsewhere the specification states that AP delta starts as the difference against the alignment start position and against the previous position from then on, thus if we were to make this change then for the first sequence it is already clear that we delta against zero as that is what the start value is stated to be for multi-ref. What's unspecified is what AP delta should do when we change reference, so maybe the 1000th sequence in the slice, but the 1st sequence on a new reference.

If we'd thought of this up front we would probably just delta against zero again because we've started a new sequence, but looking at my code it appears I only reset the last position field on the start of each new slice, so actually it'd delta against the last position in the previous slice. I think the Java code is much the same. Therefore we should define it to initially delta against slice start (already stated) and subsequently delta to the position of the previous record (also stated), even when that record was in a previous chromosome (not yet explicitly stated).

I.e. (start=0) chr1/100, chr1/200, chr1/250, chr2/50, chr2/200 would yield AP +100, +100, +40, -200, +150.

If I recall when I removed the explicit check from preventing AP_delta from being defined to 0 in my code for multi-ref slices both htslib and htsjdk were quite happy to correctly decode the positions and apply a potentially large negative value to the delta to rewind back to the start of the next sequence. I think we should just be lead by the existing implementations here.

cmnbroad · 2019-09-26T18:54:30Z

Makes sense. So the summary proposed of changes then would be:

that AP data series is signed, and that for APDelta = true, AP data series values can be negative
that for MULTI_REF slices, when APDelta = true, the alignment start for the first record is delta-encoded from the slice alignment start (which must be 0), and all subsequent records are delta encoded from the previous record, including when the reference id changes
update this section 10.2 (CRAM Positional Data) to reflect the expanded use of delta encoding for multi-ref:

The AP data series is delta encoded for reads mapped to a position-sorted slice containing data from a single reference, and as a normal integer value in all other cases.

For completeness, add:

for multi-ref containers, alignment start/span should be set to 0.

Two remaining questions it might be worth being explicit about:

What about the case where APDelta = true, and a MULTI_REF slice contains a series of mapped records, followed by some unmapped records. That is, the reference sequence goes from nn to -1, and alignment start goes from some non-zero value to 0. Continue delta encoding alignment start ?
Its a bit of an edge case, but is it the case that for UNMAPPED containers/slices, APDelta must be false. Or is true permitted ? The values are 0 either way, but it might be worth resolving the ambiguity in the spec by just saying APDelta.

jkbonfield · 2019-09-27T08:15:45Z

Agreed. Good catch on those last points.

I hadn't really thought about multi-ref with mapped followed by unmapped data. I don't know if we ever generate data like that. I think to be consistent the logic would be if a series of positions would just become a series of delta irrespective of reference ID and thus irrespective of unmapped status. Ie (chr1/100, chr1/150, chr1/210, */0, */0, */0) becomes pos 100 +50, +60, -210, 0, 0.

The second question of entirely unmapped containers, I don't see that it makes any difference either way. We could ban AP delta in this case, but it doesn't really feel worth it as it's not a check I'm likely to add given with or without it the result is identical. I don't see what the ambiguity is really. AP delta is solely about delta encoding of the AP data series - nothing more and nothing less. The proposal actually strengthens this by removing some of the statements that tie it to contents of non-AP data series.

(For clarity, we may wish to be explicit that it is permitted albeit of nul effect.)

jkbonfield · 2019-09-27T11:26:31Z

I just looked at the CRAM files we produce and had a surprise.

I've been setting AP-delta to 1 for unmapped slices in position sorted files since day 1. I can get the same behaviour out of htsjdk too. If we codified that as illegal then all our crams are broken. The fact no one has ever noticed this indicates that it's an irrelevance.

I also forced multi-ref mode on a small test set (10 mapped and 10 unmapped/unplaced) and it stored all 20 together in the same container, although disabled AP in that scenario. Hacking the code to enable it does retest my above assertion; AP becomes negative but both picard and scramble can decode it fine.

* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431

jkbonfield added the cram label Jul 16, 2019

jkbonfield mentioned this issue Sep 27, 2019

Permit AP_Delta in multi-ref slices #446

Merged

jkbonfield closed this as completed in 552c3a8 Oct 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRAM AP_delta query #431

CRAM AP_delta query #431

jkbonfield commented Jul 16, 2019 •

edited

Loading

cmnbroad commented Jul 19, 2019

jkbonfield commented Jul 19, 2019

cmnbroad commented Jul 22, 2019

jkbonfield commented Aug 20, 2019

cmnbroad commented Sep 25, 2019

jkbonfield commented Sep 25, 2019 •

edited

Loading

cmnbroad commented Sep 26, 2019 •

edited

Loading

jkbonfield commented Sep 27, 2019

jkbonfield commented Sep 27, 2019

CRAM AP_delta query #431

CRAM AP_delta query #431

Comments

jkbonfield commented Jul 16, 2019 • edited Loading

cmnbroad commented Jul 19, 2019

jkbonfield commented Jul 19, 2019

cmnbroad commented Jul 22, 2019

jkbonfield commented Aug 20, 2019

cmnbroad commented Sep 25, 2019

jkbonfield commented Sep 25, 2019 • edited Loading

cmnbroad commented Sep 26, 2019 • edited Loading

jkbonfield commented Sep 27, 2019

jkbonfield commented Sep 27, 2019

jkbonfield commented Jul 16, 2019 •

edited

Loading

jkbonfield commented Sep 25, 2019 •

edited

Loading

cmnbroad commented Sep 26, 2019 •

edited

Loading