Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tabix failure with a tab-separated file #1165

Closed
p69180 opened this issue Oct 25, 2020 · 1 comment · Fixed by #1350
Closed

tabix failure with a tab-separated file #1165

p69180 opened this issue Oct 25, 2020 · 1 comment · Fixed by #1350

Comments

@p69180
Copy link

p69180 commented Oct 25, 2020

Hi,

I tried to index a tab-separated file, but did not work. I cannot find out what is wrong based on the manual.
(tabix & bgzip version: 1.10.2)

What I did:

$ cat segments_edit.txt
1 12807 1363539 2 1 1
1 1375071 2390715 2 1 1
1 2390840 2391074 13 13 0
1 2391081 2606687 3 3 0
1 2606722 2607162 10 8 2
1 2608327 2769359 3 3 0
1 2769933 4692525 2 1 1
1 4692613 4693465 5 4 1
1 4693471 5727630 2 1 1
1 5727636 5729995 9 9 0
$ bgzip -c segments_edit.txt > segments_edit.txt.gz
$ tabix -s 1 -b 2 -e 3 segments_edit.txt.gz
[E::hts_hopen] Failed to open file segments_edit.txt.gz
[E::hts_open_format] Failed to open file "segments_edit.txt.gz" : Exec format error
Couldn't understand format of "segments_edit.txt.gz"

Curiously, a similar tab-separated file with more columns did not make the error.

$ cat segments_edit2.txt
1 12807 1363539 0.46048791990606 606 0.130356089752489 1.01605651970555 34814 0.270825944513021 2 1 1 -6.79066486181859
1 1375071 2390715 0.498852332976964 1100 0.10633779244189 1.02051517964992 37882 0.268514575387998 2 1 1 -6.73678537439279
1 2390840 2391074 0.149660750644182 9 0.136341483048997 1.74919776595488 18 0.29637515971167 13 13 0 -6.73342046642509
1 2391081 2606687 0.312823414314777 352 0.105477395509564 1.02408066029078 8861 0.248499967924125 3 3 0 -6.76450678203741
1 2606722 2607162 0.310212098970814 23 0.130361901245888 0.814279114141102 61 0.296649338537346 10 8 2 -6.7356238928564
1 2608327 2769359 0.313026985360921 625 0.106897888180431 1.02576493748744 7895 0.223811722067425 3 3 0 -6.76170437072213
1 2769933 4692525 0.494485945916869 2558 0.0913973159663615 1.02104811250325 67219 0.265606106221456 2 1 1 -6.74269965061846
1 4692613 4693465 0.353173170862764 45 0.126153578040915 0.815532434283755 89 0.223744293072089 5 4 1 -6.73753695859768
1 4693471 5727630 0.498894084261619 1228 0.0763217597526189 1.01600832273797 39203 0.255378732663932 2 1 1 -6.73573321327665
1 5727636 5729995 0.191394836620607 94 0.108008545041011 0.855271544591458 220 0.277547127775007 9 9 0 -6.74512739178415
$ bgzip -c segments_edit2.txt > segments_edit2.txt.gz
$ tabix -s 1 -b 2 -e 3 segments_edit2.txt.gz

What am I doing wrong?
Thank you in advance.

@jmarshall
Copy link
Member

This is the same problem as #1085 — your file matches the pattern for a FASTQ index file. If this were to be fixed, the practical way to do it would probably be to recognise these obscure format types only if the filename has .fai/.fqi in it, and otherwise let them fall through to match the more generic BED (as discussed on the previous issue).

@p69180 p69180 closed this as completed Oct 25, 2020
jmarshall added a commit to jmarshall/htslib that referenced this issue Oct 31, 2021
Format detection to date uses only the stream contents, as filenames
are not always available (e.g., when reading from standard input) or
may be inaccurate or unexpected. However there are a very few cases
where the filename extension is important:

* FASTA/Q indexes (uncommon for hts_open()) are a particular case
  of 5/6-column BED files (comparatively common). We don't want to
  misrecognise any actual BED files as FASTA/Q indexes, so require a
  .fai/.fqi extension for the latter -- which are unlikely to appear
  on standard input anyway, so filenames will usually be available.

* GZI indexes have not previously been recognised, as they have no
  magic numbers. They can now be recognised by their .gzi extension.

Fixes samtools#1085, fixes samtools#1165, and fixes samtools#1347.
whitwham pushed a commit that referenced this issue Nov 3, 2021
Format detection to date uses only the stream contents, as filenames
are not always available (e.g., when reading from standard input) or
may be inaccurate or unexpected. However there are a very few cases
where the filename extension is important:

* FASTA/Q indexes (uncommon for hts_open()) are a particular case
  of 5/6-column BED files (comparatively common). We don't want to
  misrecognise any actual BED files as FASTA/Q indexes, so require a
  .fai/.fqi extension for the latter -- which are unlikely to appear
  on standard input anyway, so filenames will usually be available.

* GZI indexes have not previously been recognised, as they have no
  magic numbers. They can now be recognised by their .gzi extension.

Fixes #1085, fixes #1165, and fixes #1347.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants