Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 0 coordinate in BCF #1476

Merged
merged 1 commit into from
Jul 20, 2022
Merged

Support 0 coordinate in BCF #1476

merged 1 commit into from
Jul 20, 2022

Conversation

pd3
Copy link
Member

@pd3 pd3 commented Jul 20, 2022

The 0 coordinate is valid in VCF specification, but the round-trip
VCF -> BCF -> VCF turns MT:0 into MT:4294967296. Add a check to
detect this overflow.

See #1475 and samtools/bcftools#1753

@daviesrob
Copy link
Member

This need to be rebased on the current develop branch to get the tests to work properly.

The BCF specification says that POS should be an int32_t, so should we actually switch to le_to_i32() for this value? It would put a rather bigger restriction on the longest length supported, but going beyond INT_MAX is likely to cause problems in other areas (notably END tags) so it might not be a problem in practice.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 20, 2022

Out of interest, is the BCF spec just pandering to Java here or is there a valid reason for POS being signed? Specifically does BCF permit negative positions? It doesn't particularly make any sense to me and the VCF part of the spec indicates 0 and above only.

I see CHROM is also signed, which again makes no sense. I suspect this is just the same issue we had with BAM where some things were lazily defined.

As such I don't see the issue with our code treating pos as unsigned, as I'd argue it's the spec that is invalid and not the implementation.

Edit: although having said that I see we already use le_to_i32() for rid (chromosome). Can these be negative? AFAIK we don't have a notion of "*" chr as we doing in unmapped SAM, so the offset into the dictionary should always be positive. Putting -1 in it gave me an error, so that seems probable.

The 0 coordinate is valid in VCF specification, but the round-trip
VCF -> BCF -> VCF turns MT:0 into MT:4294967296. Add a check to
detect this overflow.

See samtools#1475 and samtools/bcftools#1753
@pd3
Copy link
Member Author

pd3 commented Jul 20, 2022

I also think it's a result of lazy definition, there is no reason for negative coordinates, even on circular chromosomes. Forced pushed a code that is in sync with the latest develop.

@daviesrob
Copy link
Member

Precisely one negative coordinate is allowed in BCF, specifically -1, as already noted. Hence it's signed in the specification.

bcf_record_check() currently checks for negative contig IDs, but not negative positions (< -1). vcf_parse() uses hts_str2uint() for pos so doesn't allow negative values (remember vcf pos is bcf pos + 1). It also sets BCF_IS_64BIT if pos >= INT32_MAX which will cause bcf_write() to reject the record and advise using VCF instead. So it's quite difficult to get a pos value in a BCF file that big - I suspect you'd have to hand craft it.

So this fix will work, but pos > INT32_MAX in BCF isn't really supported.

@jkbonfield
Copy link
Contributor

Ah of course, I'm forgetting bcf subtracts 1 from VCF, so the 0 goes negative.

Definitely in "wouldn't start from here" territory. :-)

@daviesrob
Copy link
Member

I've now checked that this fixed the problem reported on bcftools sort, so:
Fixes #1475
Fixes samtools/bcftools#1753

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants