Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools norm NaN > nan #755

Closed
ben-sanders opened this issue Mar 7, 2018 · 6 comments
Closed

bcftools norm NaN > nan #755

ben-sanders opened this issue Mar 7, 2018 · 6 comments
Labels

Comments

@ben-sanders
Copy link

bcftools norm seems to be converting NaN values (specfically MQ=NaN) to nan. This is causing an error during downstream processing of the normalised VCF with GATK.

Initial problem was with version 1.2.1 (version installed on our HPC system), but I've confirmed it still happens in the latest 1.7

@pd3
Copy link
Member

pd3 commented Mar 7, 2018

This could be probably fixed by using %F instead of %f in formatting expressions.

I am not sure if the fix shouldn't be on GATK side though. There was another case when GATK was refusing to parse 0 as a float and required 0.0.

@pd3 pd3 added the wontfix label May 15, 2018
@pd3
Copy link
Member

pd3 commented May 15, 2018

We discussed this internally and the consensus is that this should be fixed on GATK side.

I am happy to accept pull requests enabling to work around this problem similar to this https://github.com/samtools/bcftools/blob/develop/misc/fix-broken-GATK-Double-vs-Integer

@lbergelson
Copy link
Member

@pd3 While the 0 vs 0.0 is admittedly a dumb gatk issue, NaN vs nan is actually a decision in the java standard library... It's not clear to me that it was a good decision, but it makes it more of a pain to work around. We could potentially screen all double fields for alternate capitalizations and normalize them to NaN, but it's an extra expensive check that would be better avoided if possible.

    @Test
    public void nanTest(){
        Assert.assertEquals(Double.NaN, Double.valueOf("NaN"));
        Assert.assertThrows(NumberFormatException.class, () -> Double.valueOf("nan"));
    }

@jmarshall
Copy link
Member

jmarshall commented Jan 18, 2019

C's printf can print either nan or NAN for NaNs, and for infinities either inf/infinity or INF/INFINITY (and you can't much control whether that comes out as 3 or 8 letters). Producing the mixed case would require explicit code to check whether the value being output was NaN or infinite and output the appropriate literal text rather than using %f or the like.

The VCF spec actually says NaN but is unclear whether it is intended to enforce that capitalisation. The spec also says infinities are written as +/-Inf which can't be parsed directly by Java's Double.valueOf (or printed by Java's Double.toString).

It seems to me that the sensible way forward is for the spec to be relaxed to allow any mixture of case and either inf or infinity — see samtools/hts-specs#89 (comment). Then C input and output and Java output will need no special case code for NaNs or infinities.

Java input code will need special case code to input otherwise-cased NaNs and infinities — but that's inevitable given Double.valueOf's inflexibility! This need not be expensive in the usual case, as you can use a wrapper function that catches the exception and does no extra work in the usual numeric case:

Double read_a_double(str) {
  try {
    return Double.valueOf(str);
  }
  catch (NumberFormatException) {
    str = str.toLower();
    if (str == "nan") return Double.NaN;
    else if// etc for "inf" and "infinity"
    else rethrow;
  }
}

@lbergelson
Copy link
Member

@jmarshall I take your point. It seems like this should be nailed down in some floating point RFC, but if C stdlib produces a mix of case and inf/infinity than I guess there isn't much hope for a standardized naming scheme...

The reason this is an issue at all is that these failures are happening in library code where there isn't an easy way to change the parsing function. We can figure something out though... It always feels to me that using JEXL for specifying vcf filter expressions causes more problems than it's solved for us...

@jmarshall
Copy link
Member

Having now looked at the stack trace in the linked issue, I see this is being parsed in some other non-bioinformatics library that's not HTSJDK. Oh… bad luck 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants