-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow C and Java native text spellings of NaN and infinities #409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmarshall 👍 This makes sense. Actually fixing the issue in the java implementation is a nasty can of worms though, since it makes use of libraries that unsurprisingly, use Double.valueOf
and it's complicated to work around that.
Incidentally, C also accepts |
+1 |
Given this comment, it should be briefly mentioned in the spec that native support for these values varies among programming languages. |
* Different languages convert floating point numbers to String representations in different ways. We now accept NAN, INF, or INFINITY in any case instead of only NaN and Inf when reading VCFs. * See samtools/hts-specs#409 for more discussion.
@pd3 @lbergelson could you please review this again after the latest changes? |
The formatting in the pdf is a bit strange now. The regex runs off into the margin and almost off the page. I'm not sure how to fix it though, my latex is weak. @yfarjoun you're a latex master aren't you? |
The text seems fine to me though. |
VCFv4.3.tex
Outdated
|
||
\begin{itemize} | ||
\item Integer (32-bit, signed): Values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}. | ||
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively. | |
\item Float (32-bit IEEE-754): Formatted to match one of the regular expressions \verb"^[-+]?(INF|INFINITY|NAN)$" or \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| case insensitively. |
(so that the text doesn't fall out of the margin.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numeric is the common case and ∞/NaN is the corner case, so swapping them around would not be ideal exposition. (And “case insensitively” pertains to the ∞/NaN regex in particular.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could join them into one regular expression and put that on it's own line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, but considered that it was more legible as two separate regexps. YMMV.
The real problem here is that a monster regexp is far from an ideal way to say “floating point number text goes here” — see also #89 (comment).
I'm not sure that programming language foibles need to be mentioned; but it can be useful and there is precedent in the SAM spec (which talks about Java GZIPInputStream's (historical?) problems). I see you've now pushed to this branch, so I've pushed such a footnote to nan-inf-footnote. |
C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.)
As no-one else has done anything about fixing up @cyenyxe's reformatting of this paragraph, I have rebased everything onto current master and replaced the PR branch with my commit adding the requested footnote (as mentioned in #409 (comment)). This does everything that has been requested (ie adds a footnote about the Java API function) while maintaining the existing single-paragraph formatting, and has no formatting issues. Can this be re-reviewed and merged, please? Afterwards, if someone wants to revisit @cyenyxe's conversion of the paragraph to an itemised list of data types, it's available in this jmarshall/nan-inf-cyenyxe branch. |
* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431
The VCF v4.3 spec currently says of Float fields that they are:
It is not explicit whether
NaN
andInf
are intended to be case sensitive, but as they are adjacent to a regexp saying…[eE]…
the reader's best guess is probably that they are indeed intended to specify that exact capitalisation.C's
printf
outputs+
/-
/no sign as appropriate, followed bynan
andinf
/infinity
(with%f
etc) orNAN
andINF
/INFINITY
(with%F
etc); there is no programmer control over whether infinity is 3 or 8 letters, but current glibc and BSD libc output it as 3 letters. Itsscanf
andstrtod
accept all of those completely case-insensitively.Java's
Double.toString
outputsNaN
,Infinity
, or-Infinity
signed, capitalised, and spelt exactly thus. ItsDouble.valueOf
accepts an optional+
/-
sign followed byNaN
orInfinity
capitalised and spelt exactly thus.Thus the VCF specification's mixed capitalisation is impossible to output using C's builtin functions, and its
+/-Inf
abbreviation is impossible to parse or output using Java's builtin functions.In practice, htslib/bcftools outputs
nan
(samtools/bcftools#755) and htsjdk/picard outputsInfinity
— regardless of what the spec says.This PR fixes that by changing the VCF spec to allow the union of C and Java representations, i.e., allowing what
scanf
accepts, namely{+,-,}{NAN,INF,INFINITY}
case-insensitively.With this change, both input and output can be done with C's native functions and similarly Java can output Floats with
Double.toString
. Java input will still need special case code to parse otherwise-cased and abbreviated NaNs and infinities, but that's inevitable givenDouble.valueOf
's inflexibility.See also previous discussions in samtools/bcftools#755 (comment) and #89 (comment).