Allow C and Java native text spellings of NaN and infinities #409

jmarshall · 2019-05-13T22:11:48Z

The VCF v4.3 spec currently says of Float fields that they are:

32-bit, formatted to match the regular expression ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$, NaN, or +/-Inf.

It is not explicit whether NaN and Inf are intended to be case sensitive, but as they are adjacent to a regexp saying …[eE]… the reader's best guess is probably that they are indeed intended to specify that exact capitalisation.

C's printf outputs +/-/no sign as appropriate, followed by nan and inf/infinity (with %f etc) or NAN and INF/INFINITY (with %F etc); there is no programmer control over whether infinity is 3 or 8 letters, but current glibc and BSD libc output it as 3 letters. Its scanf and strtod accept all of those completely case-insensitively.

Java's Double.toString outputs NaN, Infinity, or -Infinity signed, capitalised, and spelt exactly thus. Its Double.valueOf accepts an optional +/- sign followed by NaN or Infinity capitalised and spelt exactly thus.

Thus the VCF specification's mixed capitalisation is impossible to output using C's builtin functions, and its +/-Inf abbreviation is impossible to parse or output using Java's builtin functions.

In practice, htslib/bcftools outputs nan (samtools/bcftools#755) and htsjdk/picard outputs Infinity — regardless of what the spec says.

This PR fixes that by changing the VCF spec to allow the union of C and Java representations, i.e., allowing what scanf accepts, namely {+,-,}{NAN,INF,INFINITY} case-insensitively.

With this change, both input and output can be done with C's native functions and similarly Java can output Floats with Double.toString. Java input will still need special case code to parse otherwise-cased and abbreviated NaNs and infinities, but that's inevitable given Double.valueOf's inflexibility.

See also previous discussions in samtools/bcftools#755 (comment) and #89 (comment).

lbergelson

@jmarshall 👍 This makes sense. Actually fixing the issue in the java implementation is a nasty can of worms though, since it makes use of libraries that unsurprisingly, use Double.valueOf and it's complicated to work around that.

jmarshall · 2019-05-20T15:35:29Z

Incidentally, C also accepts NAN([0-9a-zA-Z_]…) to encode specific NaN values in some implementation-defined way. The current dominant library implementations do not output that, and this PR does not propose to add such notation to VCF.

hts-specs-bot · 2019-05-29T09:20:40Z

Changed PDFs as of 0cf3ecf: VCFv4.3 (diff).

pd3 · 2019-06-04T09:23:41Z

+1

cyenyxe · 2019-06-24T13:15:03Z

@jmarshall This makes sense. Actually fixing the issue in the java implementation is a nasty can of worms though, since it makes use of libraries that unsurprisingly, use Double.valueOf and it's complicated to work around that.

Given this comment, it should be briefly mentioned in the spec that native support for these values varies among programming languages.

* Different languages convert floating point numbers to String representations in different ways. We now accept NAN, INF, or INFINITY in any case instead of only NaN and Inf when reading VCFs. * See samtools/hts-specs#409 for more discussion.

hts-specs-bot · 2019-07-24T14:08:42Z

Changed PDFs as of 38bd134: VCFv4.3 (diff).

hts-specs-bot · 2019-07-24T14:11:24Z

Changed PDFs as of dc46f70: VCFv4.3 (diff).

cyenyxe · 2019-07-24T14:13:07Z

@pd3 @lbergelson could you please review this again after the latest changes?

lbergelson · 2019-07-24T19:14:46Z

The formatting in the pdf is a bit strange now. The regex runs off into the margin and almost off the page. I'm not sure how to fix it though, my latex is weak. @yfarjoun you're a latex master aren't you?

lbergelson · 2019-07-24T19:15:24Z

The text seems fine to me though.

yfarjoun · 2019-07-24T21:03:30Z

VCFv4.3.tex

+
+\begin{itemize}
+  \item Integer (32-bit, signed): Values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.
+  \item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively.


Suggested change

\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively.

\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb"^[-+]?(INF|INFINITY|NAN)$" or \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| case insensitively.

(so that the text doesn't fall out of the margin.)

Numeric is the common case and ∞/NaN is the corner case, so swapping them around would not be ideal exposition. (And “case insensitively” pertains to the ∞/NaN regex in particular.)

We could join them into one regular expression and put that on it's own line.

I tried that, but considered that it was more legible as two separate regexps. YMMV.

The real problem here is that a monster regexp is far from an ideal way to say “floating point number text goes here” — see also #89 (comment).

jmarshall · 2019-07-24T22:23:02Z

I'm not sure that programming language foibles need to be mentioned; but it can be useful and there is precedent in the SAM spec (which talks about Java GZIPInputStream's (historical?) problems).

I see you've now pushed to this branch, so I've pushed such a footnote to nan-inf-footnote.

C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.)

hts-specs-bot · 2019-08-20T10:25:05Z

Changed PDFs as of 87ac084: VCFv4.3 (diff).

jmarshall · 2019-08-20T10:42:20Z

As no-one else has done anything about fixing up @cyenyxe's reformatting of this paragraph, I have rebased everything onto current master and replaced the PR branch with my commit adding the requested footnote (as mentioned in #409 (comment)).

This does everything that has been requested (ie adds a footnote about the Java API function) while maintaining the existing single-paragraph formatting, and has no formatting issues. Can this be re-reviewed and merged, please?

Afterwards, if someone wants to revisit @cyenyxe's conversion of the paragraph to an itemised list of data types, it's available in this jmarshall/nan-inf-cyenyxe branch.

CRAMv3 PRs #401 and #412. VCFv4.3 PRs #380 (<NON_REF>), #409 (infinity/NaN), and #436 (INFO/END). All: Minor typo and whitespace formatting fixes.

* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431

jmarshall added the vcf label May 13, 2019

jmarshall requested review from cyenyxe and lbergelson May 13, 2019 22:14

jmarshall mentioned this pull request May 13, 2019

Tolerate lower-case nans in QUAL samtools/htsjdk#1364

Merged

5 tasks

jmarshall force-pushed the nan-inf branch from fb5e462 to 9f1a158 Compare May 14, 2019 09:58

samtools deleted a comment from hts-specs-bot May 14, 2019

lbergelson approved these changes May 15, 2019

View reviewed changes

jmarshall force-pushed the nan-inf branch from 9f1a158 to 0cf3ecf Compare May 29, 2019 09:18

samtools deleted a comment from hts-specs-bot May 29, 2019

yfarjoun approved these changes May 30, 2019

View reviewed changes

cyenyxe approved these changes Jul 24, 2019

View reviewed changes

yfarjoun reviewed Jul 24, 2019

View reviewed changes

jmarshall added 2 commits August 20, 2019 11:14

Add footnote noting Java Double.valueOf Infinity/NaN capitalisation

87ac084

jmarshall force-pushed the nan-inf branch from dc46f70 to 87ac084 Compare August 20, 2019 10:23

cyenyxe merged commit 2e0f38c into samtools:master Aug 22, 2019

jmarshall deleted the nan-inf branch August 22, 2019 10:23

jmarshall added a commit that referenced this pull request Aug 22, 2019

Update PDFs (CRAM and VCF additions; others cosmetic)

c6f6d93

CRAMv3 PRs #401 and #412. VCFv4.3 PRs #380 (<NON_REF>), #409 (infinity/NaN), and #436 (INFO/END). All: Minor typo and whitespace formatting fixes.

jmarshall mentioned this pull request Aug 19, 2021

Float field Null/NA/NaN Values samtools/bcftools#1558

Closed

jmarshall mentioned this pull request Mar 18, 2022

BED spec doesn't handle give an opinion on NaN in numeric fields #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow C and Java native text spellings of NaN and infinities #409

Allow C and Java native text spellings of NaN and infinities #409

jmarshall commented May 13, 2019 •

edited

Loading

lbergelson left a comment

jmarshall commented May 20, 2019 •

edited

Loading

hts-specs-bot commented May 29, 2019

pd3 commented Jun 4, 2019

cyenyxe commented Jun 24, 2019 •

edited

Loading

hts-specs-bot commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

cyenyxe commented Jul 24, 2019

lbergelson commented Jul 24, 2019

lbergelson commented Jul 24, 2019

yfarjoun Jul 24, 2019

jmarshall Jul 25, 2019

lbergelson Jul 25, 2019

jmarshall Jul 25, 2019

jmarshall commented Jul 24, 2019

hts-specs-bot commented Aug 20, 2019

jmarshall commented Aug 20, 2019

	\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb\|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$\| or \verb"^[-+]?(INF\|INFINITY\|NAN)$" case insensitively.
	\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb"^[-+]?(INF\|INFINITY\|NAN)$" or \verb\|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$\| case insensitively.

Allow C and Java native text spellings of NaN and infinities #409

Allow C and Java native text spellings of NaN and infinities #409

Conversation

jmarshall commented May 13, 2019 • edited Loading

lbergelson left a comment

Choose a reason for hiding this comment

jmarshall commented May 20, 2019 • edited Loading

hts-specs-bot commented May 29, 2019

pd3 commented Jun 4, 2019

cyenyxe commented Jun 24, 2019 • edited Loading

hts-specs-bot commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

cyenyxe commented Jul 24, 2019

lbergelson commented Jul 24, 2019

lbergelson commented Jul 24, 2019

yfarjoun Jul 24, 2019

Choose a reason for hiding this comment

jmarshall Jul 25, 2019

Choose a reason for hiding this comment

lbergelson Jul 25, 2019

Choose a reason for hiding this comment

jmarshall Jul 25, 2019

Choose a reason for hiding this comment

jmarshall commented Jul 24, 2019

hts-specs-bot commented Aug 20, 2019

jmarshall commented Aug 20, 2019

jmarshall commented May 13, 2019 •

edited

Loading

jmarshall commented May 20, 2019 •

edited

Loading

cyenyxe commented Jun 24, 2019 •

edited

Loading