Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools view silently ignores long lines #1614

Closed
hannespetur opened this issue Nov 17, 2021 · 3 comments
Closed

bcftools view silently ignores long lines #1614

hannespetur opened this issue Nov 17, 2021 · 3 comments
Labels
cannot-fix htslib-dependent Cannot be fixed until htslib is fixed

Comments

@hannespetur
Copy link

hannespetur commented Nov 17, 2021

Hello, thank you for your work on htslib and bcftools.

I am working with VCF files containing more than a hundred thousands of samples and I have noticed that "bcftools view" is silently ignoring any VCF records that are very long, I believe the problem starts occurring when the uncompressed line length exceeds 2^31 bytes (2GB). I have a script that can reproduce the problem by creating a dummy VCF file with a long line: https://gist.github.com/hannespetur/b8d44ba90deb3f4510753262a9db94ee

First argument is the number of samples the file has, second argument is the number of alternative alleles the record has. The VCF line has a PL value for each possible diploid genotype, so with more alternative alleles the line size grows very fast.

$ bash reproduce.sh 1000000 10 | bcftools view -H - | wc -l # 100,000 samples, 10 alts
1 # Expected
$ bash reproduce.sh 1000000 120 | bcftools view -H - | wc -l # 100,000 samples, 120 alts
0 # Not expected

I have noticed that bgzip can handle lines of this length

$ bash reproduce.sh 100000 120 | bgzip -c > test.vcf.gz
$ zgrep -v "^#" | wc -l
1 # Expected

however bcftools index/tabix cannot index the file.

$ bcftools index test.vcf.gz
index: failed to create index for "test.vcf.gz"
$ echo $?
255

Do you have any workarounds for working with large VCF lines like this?

Tested on

Red Hat Enterprice Linux 7 (64-bit)
Intel Xeon E5-2690 v4
128 GB RAM
bcftools 1.14 and htslib 1.14. Problem was also reproduced on 1.10.2.

@pd3
Copy link
Member

pd3 commented Nov 18, 2021

This comes from a limitation of the BCF format which cannot represent lines longer than that. BCFtools / HTSlib use BCF as the internal format, even when on input is a VCF, it is first converted into BCF structures.

The big lines are usually caused by Number=G tags, such as genotype likelihoods (FORMAT/PL), therefore in practice we store their localized version as described here:

bcftools/doc/bcftools.txt

Lines 1744 to 1753 in 756e636

*-L, --local-alleles* 'INT'::
Sites with many alternate alleles can require extremely large storage space which
can exceed the 2GB size limit representable by BCF. This is caused
by Number=G tags (such as FORMAT/PL) which store a value for each combination of reference
and alternate alleles. The *-L, --local-alleles* option allows to replace such tags
with a localized tag (FORMAT/LPL) which only includes a subset of alternate alleles relevant
for that sample. A new FORMAT/LAA tag is added which lists 1-based indices of the
alternate alleles relevant (local) for the current sample. The number 'INT' gives the
maximum number of alternate alleles that can be included in the PL tag. The default value
is 0 which disables the feature and outputs values for all alternate alleles.

If your large VCF is a product of bcftools merge, you can avoid the problem by adding the -L option to the command.

@hannespetur
Copy link
Author

Ok thank you for the explanation. I will try out using the FORMAT/LPL and FORMAT/LAA fields.

@pd3 pd3 added htslib-dependent Cannot be fixed until htslib is fixed cannot-fix labels Dec 16, 2021
@pd3
Copy link
Member

pd3 commented Dec 16, 2021

I am closing it as this 1) cannot be fixed due to BCF spec limitation and 2) has a functional workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannot-fix htslib-dependent Cannot be fixed until htslib is fixed
Projects
None yet
Development

No branches or pull requests

2 participants