Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider filename extensions (where necessary) in format detection #1350

Merged
merged 1 commit into from
Nov 3, 2021

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Oct 31, 2021

Fix the issue of some BED files being misrecognised as FASTA/Q indexes, by — as suggested in #1085 (comment) and #1085 (comment) — adding another API function that also considers the filename extension.

I have not implemented “hts_detect_format() should only recognise files as fqi_format or fai_format if they are uncompressed” — as suggested in #1347 (comment) — as it would mean that an affected BED file would be recognised as BED if compressed but as a FASTA/Q index if not compressed, and that seems like an unfortunate inconsistency. (The inconsistency of misrecognising an actual FASTA-IDX as BED when it's read from standard input (uncommon for index files!) is comparatively minor.)

  • FASTA/Q indexes (uncommon for hts_open()) are a particular case of 5/6-column BED files (comparatively common). We don't want to misrecognise any actual BED files as FASTA/Q indexes, so require a .fai/.fqi extension for the latter — which are unlikely to appear on standard input anyway, so filenames will usually be available.

We now have the machinery to recognise GZI indexes as well:

  • GZI indexes have not previously been recognised, as they have no magic numbers. They can now be recognised by their .gzi extension.

Fixes #1085, fixes #1165, and fixes #1347.

Format detection to date uses only the stream contents, as filenames
are not always available (e.g., when reading from standard input) or
may be inaccurate or unexpected. However there are a very few cases
where the filename extension is important:

* FASTA/Q indexes (uncommon for hts_open()) are a particular case
  of 5/6-column BED files (comparatively common). We don't want to
  misrecognise any actual BED files as FASTA/Q indexes, so require a
  .fai/.fqi extension for the latter -- which are unlikely to appear
  on standard input anyway, so filenames will usually be available.

* GZI indexes have not previously been recognised, as they have no
  magic numbers. They can now be recognised by their .gzi extension.

Fixes samtools#1085, fixes samtools#1165, and fixes samtools#1347.
@whitwham whitwham merged commit 9045785 into samtools:develop Nov 3, 2021
@jmarshall jmarshall deleted the bed-vs-fai branch November 3, 2021 15:12
jamespeapen added a commit to huishenlab/iscream that referenced this pull request Oct 18, 2024
1.13 doesn't correctly recognize bed indexes
<samtools/htslib#1350>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants