Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BCFtools "--write-index=tbi" and bcftools index --tbi generate different index files #2267

Closed
freeseek opened this issue Aug 21, 2024 · 3 comments · Fixed by samtools/htslib#1837
Labels
htslib-dependent Cannot be fixed until htslib is fixed

Comments

@freeseek
Copy link
Contributor

While I was trying to run the GATK HaplotypeCaller with option --alleles input.vcf.gz on a file that I had generated with BCFtools with option --write-index=tbi, I got the error:

htsjdk.samtools.util.RuntimeIOException: java.io.IOException: Invalid file pointer: 13729595107 for input.vcf.gz

I read that this is usually caused by an out-of-date or corrupt index file. I then regenerated the index file with option bcftools index --tbi and the GATK HaplotypeCaller worked without issues. Indeed the two index files generated were different with different md5sum's

I could not replicate an error with BCFtools but I did notice that --write-index=tbi and bcftools index --tbi don't always create the same index files if applied to files large enough. This is reproducible with BCFtools 1.20 (using htslib 1.20):

(echo "##fileformat=VCFv4.2"
echo "##contig=<ID=chr1>"
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
seq -f "chr1\t%.0f\t.\tA\tG\t.\t.\t." 1 6914049 | sed 's/\\t/\t/g') | \
  bcftools view --no-version --output-type z --write-index=tbi --output output.vcf.gz && \
  cat output.vcf.gz | bcftools index --force --tbi --output output.vcf.gz.tbi2 && \
  md5sum output.vcf.gz.tbi output.vcf.gz.tbi2
284fdb2a372efb91e7e692707f5ff1c6  output.vcf.gz.tbi
4cf5f01b5210ce5193c38339161027fb  output.vcf.gz.tbi2

I could not understand why they are different:

diff <(xxd output.vcf.gz.tbi) <(xxd output.vcf.gz.tbi2)
2c2
< 00000010: 4311 659a 0b78 ce75 ffc7 a7c3 50e6 b41c  C.e..x.u....P...
---
> 00000010: 4211 659a 0b78 ce75 ffc7 a7c3 50e6 b41c  B.e..x.u....P...
240,278c240,278
< 00000ef0: f905 bbfc c1c3 f293 c042 76b9 fb84 1f0d  .........Bv.....
< 00000f00: 764f d022 df52 aef1 3dea c566 4df0 f8ef  vO.".R..=..fM...
< 00000f10: 1f5b 8afd ef2e 9fdd d5b3 bb78 76d7 ceee  .[.........xv...
< 00000f20: d2d9 5d39 bb0b 6777 ddec 2e9b dd55 b3bb  ..]9..gw.....U..
< 00000f30: 6876 d7cc ee92 d95d 31bb 0b66 77bd ec2e  hv.....]1..fw...
< 00000f40: 97dd d5b2 bb58 76d7 caee 52d9 5d29 bb0b  .....Xv...R.])..
< 00000f50: 6577 9dec 2e93 dd55 b2bb 4876 d7c8 ee12  ew.....U..Hv....
< 00000f60: d95d 21bb 0b64 777d ec2e 8fdd d5b1 bb38  .]!..dw}.......8
< 00000f70: 76d7 c6ee d2fc cfae 8cd9 85b1 bb2e 7697  v.............v.
< 00000f80: c5ee aad8 5d14 bb6b 6277 49ec ae88 dd05  ....]..kbwI.....
< 00000f90: b1bb 1e76 97c3 ee6a d85d 0cbb 6b61 7729  ...v...j.]..kaw)
< 00000fa0: ecae 84dd 85b0 bb0e 7697 c1ee 2ad8 5d04  ........v...*.].
< 00000fb0: bb6b 6077 09ec ae80 dd05 b077 fdf6 2edf  .k`w.......w....
< 00000fc0: ded5 dbbb 787b d76e efd2 ed5d b9bd 0bb7  ....x{.n...]....
< 00000fd0: 77dd f62e dbde 55db bb68 7bd7 6cef 92ed  w.....U..h{.l...
< 00000fe0: 5db1 bd0b b677 bdf6 2ed7 ded5 dabb 587b  ]....w........X{
< 00000ff0: d76a ef52 ed5d a9bd 0bb5 779d f62e d3de  .j.R.]....w.....
< 00001000: 55da bb48 7bd7 68ef 12ed 5da1 bd0b b477  U..H{.h...]....w
< 00001010: 7df6 2ecf ded5 d9bb 387b d766 efd2 ec5d  }.......8{.f...]
< 00001020: 99bd 0bb3 775d f62e cbde 55d9 bb28 7bd7  ....w]....U..({.
< 00001030: 64ef 92ec 5d91 bd0b b277 3df6 2ec7 ded5  d...]....w=.....
< 00001040: d8bb 187b d762 ef52 ec5d 89bd 0bb1 771d  ...{.b.R.]....w.
< 00001050: f62e c3de 55d8 bb08 7bd7 60ef 12ec 5d81  ....U...{.`...].
< 00001060: bd0b b0b3 7e3b cbb7 b37a 3b8b b7b3 763b  ....~;...z;...v;
< 00001070: 4bb7 b372 3b0b b7b3 6e3b cbb6 b36a 3b8b  K..r;...n;...j;.
< 00001080: b6b3 663b 4bb6 b362 3b0b b6b3 5e3b cbb5  ..f;K..b;...^;..
< 00001090: b35a 3b8b b5b3 563b 4bb5 b352 3b0b b5b3  .Z;...V;K..R;...
< 000010a0: 4e3b cbb4 b34a 3b8b b4b3 463b 4bb4 b342  N;...J;...F;K..B
< 000010b0: 3b0b b4b3 3e3b cbb3 b33a 3b8b b3b3 363b  ;...>;...:;...6;
< 000010c0: 4bb3 b332 3b0b b3b3 2e3b cbb2 b32a 3b8b  K..2;....;...*;.
< 000010d0: b2b3 263b 4bb2 b322 3b0b b2b3 1e3b cbb1  ..&;K..";....;..
< 000010e0: b31a 3b8b b1b3 163b 4bb1 b312 3b0b b1b3  ..;....;K...;...
< 000010f0: 0e3b cbb0 b30a 3b8b b0b3 063b 4bb0 b302  .;....;....;K...
< 00001100: 3b0b b067 7d7b 96b7 6775 7b16 b767 6d7b  ;..g}{..gu{..gm{
< 00001110: 96b6 6765 7b16 b667 5d7b 96b5 6755 7b16  ..ge{..g]{..gU{.
< 00001120: b567 4d7b 96b4 6745 7b16 b467 3d7b 96b3  .gM{..gE{..g={..
< 00001130: 6735 7b16 fbcf 59eb d18f ff07 1f98 b6a6  g5{...Y.........
< 00001140: 4135 0000 1f8b 0804 0000 0000 00ff 0600  A5..............
< 00001150: 4243 0200 1b00 0300 0000 0000 0000 0000  BC..............
---
> 00000ef0: f905 bbdc 7de2 8906 0bd9 e5ee 13fe 3fec  ....}.........?.
> 00000f00: 9ea0 45be a55c e37b d48b cd9a e0f1 df3f  ..E..\.{.......?
> 00000f10: b614 fbdf 5d3e bbab 6777 f1ec ae9d dda5  ....]>..gw......
> 00000f20: b3bb 7276 17ce eeba d95d 36bb ab66 77d1  ..rv.....]6..fw.
> 00000f30: ecae 99dd 25b3 bb62 7617 ccee 7ad9 5d2e  ....%..bv...z.].
> 00000f40: bbab 6577 b1ec ae95 dda5 b2bb 5276 17ca  ..ew........Rv..
> 00000f50: ee3a d95d 26bb ab64 7791 ecae 91dd 25b2  .:.]&..dw.....%.
> 00000f60: bb42 7617 c8ee fad8 5d1e bbab 6377 71ec  .Bv.....]...cwq.
> 00000f70: ae8d dda5 f99f 5d19 b30b 6377 5dec 2e8b  ......]...cw]...
> 00000f80: dd55 b1bb 2876 d7c4 ee92 d85d 11bb 0b62  .U..(v.....]...b
> 00000f90: 773d ec2e 87dd d5b0 bb18 76d7 c2ee 52d8  w=........v...R.
> 00000fa0: 5d09 bb0b 6177 1dec 2e83 dd55 b0bb 0876  ]...aw.....U...v
> 00000fb0: d7c0 ee12 d85d 01bb 0b60 effa ed5d bebd  .....]...`...]..
> 00000fc0: abb7 77f1 f6ae ddde a5db bb72 7b17 6eef  ..w........r{.n.
> 00000fd0: baed 5db6 bdab b677 d1f6 aed9 de25 dbbb  ..]....w.....%..
> 00000fe0: 627b 176c ef7a ed5d aebd abb5 77b1 f6ae  b{.l.z.]....w...
> 00000ff0: d5de a5da bb52 7b17 6aef 3aed 5da6 bdab  .....R{.j.:.]...
> 00001000: b477 91f6 aed1 de25 dabb 427b 1768 effa  .w.....%..B{.h..
> 00001010: ec5d 9ebd abb3 7771 f6ae cdde a5d9 bb32  .]....wq.......2
> 00001020: 7b17 66ef baec 5d96 bdab b277 51f6 aec9  {.f...]....wQ...
> 00001030: de25 d9bb 227b 1764 ef7a ec5d 8ebd abb1  .%.."{.d.z.]....
> 00001040: 7731 f6ae c5de a5d8 bb12 7b17 62ef 3aec  w1........{.b.:.
> 00001050: 5d86 bdab b077 11f6 aec1 de25 d8bb 027b  ]....w.....%...{
> 00001060: 1760 67fd 7696 6f67 f576 166f 67ed 7696  .`g.v.og.v.og.v.
> 00001070: 6e67 e576 166e 67dd 7696 6d67 d576 166d  ng.v.ng.v.mg.v.m
> 00001080: 67cd 7696 6c67 c576 166c 67bd 7696 6b67  g.v.lg.v.lg.v.kg
> 00001090: b576 166b 67ad 7696 6a67 a576 166a 679d  .v.kg.v.jg.v.jg.
> 000010a0: 7696 6967 9576 1669 678d 7696 6867 8576  v.ig.v.ig.v.hg.v
> 000010b0: 1668 677d 7696 6767 7576 1667 676d 7696  .hg}v.gguv.ggmv.
> 000010c0: 6667 6576 1666 675d 7696 6567 5576 1665  fgev.fg]v.egUv.e
> 000010d0: 674d 7696 6467 4576 1664 673d 7696 6367  gMv.dgEv.dg=v.cg
> 000010e0: 3576 1663 672d 7696 6267 2576 1662 671d  5v.cg-v.bg%v.bg.
> 000010f0: 7696 6167 1576 1661 670d 7696 6067 0576  v.ag.v.ag.v.`g.v
> 00001100: 1660 cffa f62c 6fcf eaf6 2c6e cfda f62c  .`...,o...,n...,
> 00001110: 6dcf caf6 2c6c cfba f62c 6bcf aaf6 2c6a  m...,l...,k...,j
> 00001120: cf9a f62c 69cf 8af6 2c68 cf7a f62c 67cf  ...,i...,h.z.,g.
> 00001130: 6af6 2cf6 9fb3 d6a3 1fff 0f09 a01d e041  j.,............A
> 00001140: 3500 001f 8b08 0400 0000 0000 ff06 0042  5..............B
> 00001150: 4302 001b 0003 0000 0000 0000 0000 00    C..............

I was not able to reproduce this discrepancy when creating .csi index files. I am not sure whether this is related to my issue with GATK HaplotypeCaller and I could not replicate an error with BCFtools when using an index generated with --write-index=tbi but hopefully understanding the source of this discrepancy might be of use

@graphenn
Copy link

graphenn commented Aug 28, 2024

Hello, I find the same issue.

I also use GATK, GenomicsDBImport, I use bed file to split the workflow. Each bed interval one workflow.
The error index file broken the work at some bed regions, but success at all other regions.
For example, this file,
tbi.zip

Will cause GATK GenomicsDBImport failed at this 4 regions.

chrY    7169372 7189364
chrY    7189364 7209356
chrY    7309316 7329308
chrY    7329308 7349300

but other regions success(regions at chrY 240000~710000)

The fail log is same.

htsjdk.samtools.util.RuntimeIOException: java.io.IOException: Invalid file pointer: 864308690651 for input.g.vcf.gz

And I'm sure when I use the same command on bcftools 1.18, about 9000+ samples, all works well, no error.

But when use bcftools 1.20, this error happens.
The command is

bcftools concat "${base_name}"_A.g.vcf.gz "${base_name}"_B.g.vcf.gz |
            bcftools sort -Oz  --write-index -o "${base_name}".g.vcf.gz##idx##"${base_name}".g.vcf.gz.tbi

If I replace this tbi file with tabix command, no error happened on GATK

@pd3 pd3 added the htslib-dependent Cannot be fixed until htslib is fixed label Sep 9, 2024
@jkbonfield
Copy link
Contributor

We're looking at it, but note that tbi indices are compressed so you need to zcat them before hex-dumping to evaluate any differences.

@jkbonfield
Copy link
Contributor

jkbonfield commented Sep 12, 2024

We think we found where this problem crept in (ironically fixing a related issue with multi-threading).

For now, you can work around the problem by using --threads 1 to do asynchronous bgzf compression, which means it has to track the index pointers a little differently and should then produce the same index files as a standalone bcftools index command. A PR has been made to htslib which will fix it for the upcoming release (imminent).

Edit: note this chimes with your observations too with 1.18 working. The threading index fix (which broke the non-threading indices) was merged between 1.18 and 1.19. Thanks for the bug report.

(Note in our opinion though the index is actually valid, but it triggers a bug in htsjdk, and older htslib's too. So the fix is definitely a good one still.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
htslib-dependent Cannot be fixed until htslib is fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants