Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow C and Java native text spellings of NaN and infinities #409

Merged
merged 2 commits into from
Aug 22, 2019

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented May 13, 2019

The VCF v4.3 spec currently says of Float fields that they are:

32-bit, formatted to match the regular expression ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$, NaN, or +/-Inf.

It is not explicit whether NaN and Inf are intended to be case sensitive, but as they are adjacent to a regexp saying …[eE]… the reader's best guess is probably that they are indeed intended to specify that exact capitalisation.

C's printf outputs +/-/no sign as appropriate, followed by nan and inf/infinity (with %f etc) or NAN and INF/INFINITY (with %F etc); there is no programmer control over whether infinity is 3 or 8 letters, but current glibc and BSD libc output it as 3 letters. Its scanf and strtod accept all of those completely case-insensitively.

Java's Double.toString outputs NaN, Infinity, or -Infinity signed, capitalised, and spelt exactly thus. Its Double.valueOf accepts an optional +/- sign followed by NaN or Infinity capitalised and spelt exactly thus.

Thus the VCF specification's mixed capitalisation is impossible to output using C's builtin functions, and its +/-Inf abbreviation is impossible to parse or output using Java's builtin functions.

In practice, htslib/bcftools outputs nan (samtools/bcftools#755) and htsjdk/picard outputs Infinity — regardless of what the spec says.

This PR fixes that by changing the VCF spec to allow the union of C and Java representations, i.e., allowing what scanf accepts, namely {+,-,}{NAN,INF,INFINITY} case-insensitively.

With this change, both input and output can be done with C's native functions and similarly Java can output Floats with Double.toString. Java input will still need special case code to parse otherwise-cased and abbreviated NaNs and infinities, but that's inevitable given Double.valueOf's inflexibility.

See also previous discussions in samtools/bcftools#755 (comment) and #89 (comment).

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarshall 👍 This makes sense. Actually fixing the issue in the java implementation is a nasty can of worms though, since it makes use of libraries that unsurprisingly, use Double.valueOf and it's complicated to work around that.

@jmarshall
Copy link
Member Author

jmarshall commented May 20, 2019

Incidentally, C also accepts NAN([0-9a-zA-Z_]…) to encode specific NaN values in some implementation-defined way. The current dominant library implementations do not output that, and this PR does not propose to add such notation to VCF.

@hts-specs-bot
Copy link

Changed PDFs as of 0cf3ecf: VCFv4.3 (diff).

@samtools samtools deleted a comment from hts-specs-bot May 29, 2019
@pd3
Copy link
Member

pd3 commented Jun 4, 2019

+1

@cyenyxe
Copy link
Member

cyenyxe commented Jun 24, 2019

@jmarshall This makes sense. Actually fixing the issue in the java implementation is a nasty can of worms though, since it makes use of libraries that unsurprisingly, use Double.valueOf and it's complicated to work around that.

Given this comment, it should be briefly mentioned in the spec that native support for these values varies among programming languages.

lbergelson pushed a commit to samtools/htsjdk that referenced this pull request Jun 24, 2019
* Different languages convert floating point numbers to String representations in different ways.  We now accept NAN, INF, or INFINITY in any case instead of only NaN and Inf when reading VCFs.
*  See samtools/hts-specs#409 for more discussion.
@hts-specs-bot
Copy link

Changed PDFs as of 38bd134: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of dc46f70: VCFv4.3 (diff).

@cyenyxe
Copy link
Member

cyenyxe commented Jul 24, 2019

@pd3 @lbergelson could you please review this again after the latest changes?

@lbergelson
Copy link
Member

The formatting in the pdf is a bit strange now. The regex runs off into the margin and almost off the page. I'm not sure how to fix it though, my latex is weak. @yfarjoun you're a latex master aren't you?

image

@lbergelson
Copy link
Member

The text seems fine to me though.

VCFv4.3.tex Outdated

\begin{itemize}
\item Integer (32-bit, signed): Values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively.
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb"^[-+]?(INF|INFINITY|NAN)$" or \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| case insensitively.

(so that the text doesn't fall out of the margin.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric is the common case and ∞/NaN is the corner case, so swapping them around would not be ideal exposition. (And “case insensitively” pertains to the ∞/NaN regex in particular.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could join them into one regular expression and put that on it's own line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but considered that it was more legible as two separate regexps. YMMV.

The real problem here is that a monster regexp is far from an ideal way to say “floating point number text goes here” — see also #89 (comment).

@jmarshall
Copy link
Member Author

I'm not sure that programming language foibles need to be mentioned; but it can be useful and there is precedent in the SAM spec (which talks about Java GZIPInputStream's (historical?) problems).

I see you've now pushed to this branch, so I've pushed such a footnote to nan-inf-footnote.

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)
@hts-specs-bot
Copy link

Changed PDFs as of 87ac084: VCFv4.3 (diff).

@jmarshall
Copy link
Member Author

As no-one else has done anything about fixing up @cyenyxe's reformatting of this paragraph, I have rebased everything onto current master and replaced the PR branch with my commit adding the requested footnote (as mentioned in #409 (comment)).

This does everything that has been requested (ie adds a footnote about the Java API function) while maintaining the existing single-paragraph formatting, and has no formatting issues. Can this be re-reviewed and merged, please?

Afterwards, if someone wants to revisit @cyenyxe's conversion of the paragraph to an itemised list of data types, it's available in this jmarshall/nan-inf-cyenyxe branch.

@cyenyxe cyenyxe merged commit 2e0f38c into samtools:master Aug 22, 2019
@jmarshall jmarshall deleted the nan-inf branch August 22, 2019 10:23
jmarshall added a commit that referenced this pull request Aug 22, 2019
CRAMv3 PRs #401 and #412.
VCFv4.3 PRs #380 (<NON_REF>), #409 (infinity/NaN), and #436 (INFO/END).
All: Minor typo and whitespace formatting fixes.
thefferon added a commit to thefferon/hts-specs that referenced this pull request Oct 9, 2019
* Update CRAM spec section on substitution matrix and codes.

* Respond to review comments.

* CRAM Slice and Container ref seq IDs must match (samtools#401)

* Code review part 2.

* Fix minor typo in the predecessors of BCF2 (samtools#427)

* layed -> laid

* Update MAINTAINERS.md (samtools#432)

Proposal to add Rasko Leinonen as refget maintainer.

* add jmmut to MAINTAINERS.md and move Cristina to "Past Members"

* change order of maintainers

* Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.

* Allow C and Java native text spellings of NaN and infinities (samtools#409)

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)

* Adding a note about <NON_REF> (samtools#380)

* Update PDFs (CRAM and VCF additions; others cosmetic)

CRAMv3 PRs samtools#401 and samtools#412.
VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END).
All: Minor typo and whitespace formatting fixes.

* Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385)

Includes barebones authorizationCode Oauth2 flow, which should aid/inform
code generation.

Uses int64 with minimum 0, unsure if that is really an uint64 though.

* Codify the existing policy of generally squashing PRs (PR samtools#444)

* Permit AP_Delta in multi-ref slices.

This means AP_delta can become negative.  I have validated this
decodes fine in both htsjdk and htslib.  This is because AP is ITF8
and hence signed, like all other integers, so it would need explicit
code to forbid this (which obviously isn't in the implementations).
Hence the limitation is primarily one of an over-zealous specification.

The impact of this is for position-sorted multi-ref slices AP can
legally be stored efficiently.

Also clarified fields in the container compression header when in
multi-ref mode.

Fixes samtools#431
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants