Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-threaded on-the-fly indexing problems #1672

Merged
merged 2 commits into from
Sep 22, 2023

Conversation

daviesrob
Copy link
Member

When indexing on-the-fly, the virtual offset of the end of
each read gets passed to hts_idx_push().  As HTSlib starts
a new BGZF block if the next read doesn't fit, the stored
offset needs to be adjusted to point to the start of the
next block instead of the end of the old one.  (While the
two locations effectively map to the same uncompressed
offset, pointing to the end of a block causes problems for
HTSlib before 1.11, and HTSJDK.)

When multi-threading, this adjustment was done by calling
bgzf_idx_amend_last() from the main thread.  However, if the
data was written out too promptly, the function could be called
too late preventing the adjustment from being applied.  To fix
this, the adjustment is now moved to bgzf_idx_flush() which is
called by the writer thread, ensuring that it is always applied
on time.

The internal function bgzf_idx_amend_last() is removed as it's no
longer needed.

Fixes samtools/samtools#1861 (samtools merge --write-index
  results in Invalid file pointer [from HTSJDK] in rare cases
  whereas samtools index fixes the problem)
* Switch from hts_idx_push() to bgzf_idx_push() for on-the-fly
  indexing of BCF and VCF.bgz files.  The latter function is
  needed to record the correct offsets when using multi-threaded
  BGZF compression.

  Fixes samtools/bcftools#1985

* Only allow indexing of BGZF-compressed files.  It's necessary to
  enforce this as on-the-fly indexing assumes that the file
  pointer is in htsFile::fp.bgzf, but uncompressed VCF uses
  htsFile::fp.hfile.
@whitwham whitwham merged commit 7994291 into samtools:develop Sep 22, 2023
9 checks passed
@daviesrob daviesrob deleted the idx_amend branch September 26, 2023 13:24
jmarshall added a commit to jmarshall/htslib that referenced this pull request Jul 16, 2024
daviesrob pushed a commit that referenced this pull request Jul 17, 2024
* Ignore and clean test/*/FAIL* for six subdirectories

These files can appear in base_mods, fastq, mpileup, and sam_filter
as well as faidx and tabix.

* Fix comment header to use `@CO\t` as per other comment headers

* Remove extraneous inclusion and add missing dependency

* Remove last traces of previously deleted bgzf_idx_amend_last()

As noted in #1722, this function was removed in PR #1672.

* Use isspace_c() et al in annot-tsv.c

* Minor corrections to system headers

Plain getopt() is declared in <unistd.h>; strcasecmp() et al are only
portably declared in <strings.h>.
jkbonfield added a commit to jkbonfield/htslib that referenced this pull request Sep 12, 2024
When using bcftools view --write-index -o out.vcf.gz the virtual file
offsets can differ depending on whether we do a bgzf_tell before or
after a flush.  Specifically it could point to the last byte in the
current BGZF block or the first byte in the next BGZF block.
Ultimately both of these resolve to the same physical location, but in
some situations the former may mean attempting to read zero bytes (the
remainder of the bgzf block).  This has been known in the past to be
misinterpreted as an EOF.  (See samtools/samtools#1861)

It also means the contents of the index produced by --write-index and
a separate bcftools index command can yield different results, albeit
both representing the same data.

The fix for the samtools / bcftools issue above (samtools#1672)
when multi-threading inadvertently recreated the bug when not
multi-threading.
jkbonfield added a commit to jkbonfield/htslib that referenced this pull request Sep 12, 2024
When using bcftools view --write-index -o out.vcf.gz the virtual file
offsets can differ depending on whether we do a bgzf_tell before or
after a flush.  Specifically it could point to the last byte in the
current BGZF block or the first byte in the next BGZF block.
Ultimately both of these resolve to the same physical location, but in
some situations the former may mean attempting to read zero bytes (the
remainder of the bgzf block).  This has been known in the past to be
misinterpreted as an EOF.  (See samtools/samtools#1861)

It also means the contents of the index produced by --write-index and
a separate bcftools index command can yield different results, albeit
both representing the same data.

The fix for the samtools / bcftools issue above (samtools#1672)
when multi-threading inadvertently recreated the bug when not
multi-threading.

Fixes samtools/bcftools#2267
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants