Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float field Null/NA/NaN Values #1558

Closed
eugenegardner opened this issue Aug 18, 2021 · 5 comments
Closed

Float field Null/NA/NaN Values #1558

eugenegardner opened this issue Aug 18, 2021 · 5 comments
Labels

Comments

@eugenegardner
Copy link

Hello,

I am trying to use bcftools annotate to add new information to the INFO field. My situation is slightly more complex, but a simple example is I have a bgzipped TSV file (annotations.tsv.gz) similar to the following:

chr1 40 A T 0.1 30
chr1 50 C G 0.2 .

Where columns are: CHROM, POS, REF, ALT, MAF, CADD

I also have the following header information (header.txt):

##INFO=<ID=CADD,Number=1,Type=Float,Description="Minor Allele Frequency">
##INFO=<ID=CADD,Number=1,Type=Float,Description="CADD Phred Score">

When I run

bcftools annotate -a annotations.tsv.gz -c CHROM,POS,REF,ALT,MAF,CADD -h header.txt -Oz -o variants.annotated.vcf.gz variants.vcf.gz

I get allele records like:

#CHROM ID POS REF ALT QUAL FILTER INFO 
chr1 40 var1 A T . . MAF=0.1;CADD=30
chr1 50 var2 C G . . MAF=0.2

Note the missing CADD field for variant 2.

tl;dr, my questions is:

Is there a "NULL" value (NA/NaN/Null) that is acceptable in VCF spec for Float fields or is there some other solution I should be using here?

Thanks!

@pd3
Copy link
Member

pd3 commented Aug 19, 2021

The documentation claims that it should be possible to do this just by using . in the annotation file. However, as I just tested, this functionality got lost at some point or never was there. This will be fixed. Will need to think about various modes of operation though: what happens if there is a CADD tag already present in the VCF and how will it fit with the existing -c modifiers (i.e. =,+,-)?

@pd3 pd3 added the bug label Aug 19, 2021
@eugenegardner
Copy link
Author

eugenegardner commented Aug 19, 2021

Thanks @pd3,

I will note that if I provide a value of 'NaN' instead of the bcftools-standard ., BCFtools will drop a nan into the INFO field (note the difference here, I provide 'NaN' and bcftools interprets it as nan, so there may be some disconnect. I did not, for example, try putting 'foo' and check what bcftools places in the INFO field, maybe it is foo, maybe it is nan). I have NOT tested if filtering expressions work as expected. Some simple tests I can think of:

  1. Will -i 'CADD=nan' actually pull all nan variants?
  2. Will -i 'CADD>=0' include ALL variants other than those with nan?

I also note that this functionality is not documented in the current VCF spec for "float" INFO fields, which would be a useful addition if this functionality is actually supposed to exist.

As for -c, I suppose that would mean adding a flag for "what do I do for missing values" to annotate?

@jmarshall
Copy link
Member

I am not sure what you are saying is not documented in the VCF spec here. In VCF, NaN is nan case insensitively, as documented in VCFv4.3 §1.3 (see samtools/hts-specs#409).

@eugenegardner
Copy link
Author

Apologies @jmarshall – I didn't see it at the bottom of the spec PDF. Ignore that, then.

pd3 added a commit that referenced this issue Aug 21, 2021
Previously missing values in tab delimited files (".") were not
transferred, which is correct, but sometimes not desired (#1558).

Now it is now possible to fine-tune the behavior by adding or
not adding a new '.' modifier. For example:

-c TAG .. adds TAG if the source value is not missing. If TAG exists
          in the target file, it will be overwritten

-c .TAG ..  adds TAG even if the source value is missing. This can
            overwrite non-missing values with a missing value
            and can create empty VCF fields (`TAG=.`)
@pd3
Copy link
Member

pd3 commented Aug 21, 2021

This is now possible (3647090) by adding a new . modifier, for example use -c CHROM,POS,REF,ALT,MAF,.CADD to carry over missing CADD values (".")

Thanks for reporting the problem!

@pd3 pd3 closed this as completed Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants