bcftools norm NaN > nan #755

ben-sanders · 2018-03-07T11:44:38Z

bcftools norm seems to be converting NaN values (specfically MQ=NaN) to nan. This is causing an error during downstream processing of the normalised VCF with GATK.

Initial problem was with version 1.2.1 (version installed on our HPC system), but I've confirmed it still happens in the latest 1.7

pd3 · 2018-03-07T12:30:08Z

This could be probably fixed by using %F instead of %f in formatting expressions.

I am not sure if the fix shouldn't be on GATK side though. There was another case when GATK was refusing to parse 0 as a float and required 0.0.

pd3 · 2018-05-15T15:58:28Z

We discussed this internally and the consensus is that this should be fixed on GATK side.

I am happy to accept pull requests enabling to work around this problem similar to this https://github.com/samtools/bcftools/blob/develop/misc/fix-broken-GATK-Double-vs-Integer

lbergelson · 2019-01-17T16:38:14Z

@pd3 While the 0 vs 0.0 is admittedly a dumb gatk issue, NaN vs nan is actually a decision in the java standard library... It's not clear to me that it was a good decision, but it makes it more of a pain to work around. We could potentially screen all double fields for alternate capitalizations and normalize them to NaN, but it's an extra expensive check that would be better avoided if possible.

    @Test
    public void nanTest(){
        Assert.assertEquals(Double.NaN, Double.valueOf("NaN"));
        Assert.assertThrows(NumberFormatException.class, () -> Double.valueOf("nan"));
    }

jmarshall · 2019-01-18T11:11:40Z

C's printf can print either nan or NAN for NaNs, and for infinities either inf/infinity or INF/INFINITY (and you can't much control whether that comes out as 3 or 8 letters). Producing the mixed case would require explicit code to check whether the value being output was NaN or infinite and output the appropriate literal text rather than using %f or the like.

The VCF spec actually says NaN but is unclear whether it is intended to enforce that capitalisation. The spec also says infinities are written as +/-Inf which can't be parsed directly by Java's Double.valueOf (or printed by Java's Double.toString).

It seems to me that the sensible way forward is for the spec to be relaxed to allow any mixture of case and either inf or infinity — see samtools/hts-specs#89 (comment). Then C input and output and Java output will need no special case code for NaNs or infinities.

Java input code will need special case code to input otherwise-cased NaNs and infinities — but that's inevitable given Double.valueOf's inflexibility! This need not be expensive in the usual case, as you can use a wrapper function that catches the exception and does no extra work in the usual numeric case:

Double read_a_double(str) {
  try {
    return Double.valueOf(str);
  }
  catch (NumberFormatException) {
    str = str.toLower();
    if (str == "nan") return Double.NaN;
    else if … // etc for "inf" and "infinity"
    else rethrow;
  }
}

lbergelson · 2019-01-18T16:02:25Z

@jmarshall I take your point. It seems like this should be nailed down in some floating point RFC, but if C stdlib produces a mix of case and inf/infinity than I guess there isn't much hope for a standardized naming scheme...

The reason this is an issue at all is that these failures are happening in library code where there isn't an easy way to change the parsing function. We can figure something out though... It always feels to me that using JEXL for specifying vcf filter expressions causes more problems than it's solved for us...

jmarshall · 2019-01-18T16:14:44Z

Having now looked at the stack trace in the linked issue, I see this is being parsed in some other non-bioinformatics library that's not HTSJDK. Oh… bad luck 😢

pd3 added the wontfix label May 15, 2018

pd3 closed this as completed May 15, 2018

jemunro mentioned this issue Jan 16, 2019

Treat "NaN" and "nan" as equivalents (VariantFiltration) broadinstitute/gatk#5582

Closed

pd3 reopened this Jan 17, 2019

jmarshall mentioned this issue Jan 18, 2019

VCF ambiguities samtools/hts-specs#89

Open

lbergelson closed this as completed Jan 18, 2019

This was referenced May 8, 2019

Tolerate lower-case nans in QUAL samtools/htsjdk#1364

Merged

Allow C and Java native text spellings of NaN and infinities samtools/hts-specs#409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools norm NaN > nan #755

bcftools norm NaN > nan #755

ben-sanders commented Mar 7, 2018

pd3 commented Mar 7, 2018

pd3 commented May 15, 2018 •

edited

Loading

lbergelson commented Jan 17, 2019

jmarshall commented Jan 18, 2019 •

edited

Loading

lbergelson commented Jan 18, 2019

jmarshall commented Jan 18, 2019

bcftools norm NaN > nan #755

bcftools norm NaN > nan #755

Comments

ben-sanders commented Mar 7, 2018

pd3 commented Mar 7, 2018

pd3 commented May 15, 2018 • edited Loading

lbergelson commented Jan 17, 2019

jmarshall commented Jan 18, 2019 • edited Loading

lbergelson commented Jan 18, 2019

jmarshall commented Jan 18, 2019

pd3 commented May 15, 2018 •

edited

Loading

jmarshall commented Jan 18, 2019 •

edited

Loading