Skip to content

Commit

Permalink
Make tabix support CSI indices with large positions.
Browse files Browse the repository at this point in the history
This already worked for SAM and VCF where the SQ and Contig lines
indicate the maximum length of a reference sequence.  However for BED
files this was left as zero, which had the effect of fighting against
the user by decreasing n_lvls as we increase min_shift.

When unknown, max_ref_len is now an arbitrary large size (100G), but
this may produce more levels than are strictly necessary, although
this doesn't appear to have negative consequences.

Also fixed the misleading error message about CSI being unable to
index data.  This was perhaps intended to be for mis-specified VCF
data where a contig was listed as small but the records were at larger
offsets, however it simply lead me up the garden path by categorically
stating CSI cannot store such large values.
  • Loading branch information
jkbonfield committed Sep 9, 2022
1 parent 76d4618 commit 6366029
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 5 deletions.
6 changes: 3 additions & 3 deletions hts.c
Original file line number Diff line number Diff line change
Expand Up @@ -2354,9 +2354,9 @@ int hts_idx_check_range(hts_idx_t *idx, int tid, hts_pos_t beg, hts_pos_t end)
return 0;

if (idx->fmt == HTS_FMT_CSI) {
hts_log_error("Region %"PRIhts_pos"..%"PRIhts_pos
" cannot be stored in a csi index. "
"Please check headers match the data",
hts_log_error("Region %"PRIhts_pos"..%"PRIhts_pos" "
"cannot be stored in a csi index with these parameters. "
"Please use a larger min_shift or depth",
beg, end);
} else {
hts_log_error("Region %"PRIhts_pos"..%"PRIhts_pos
Expand Down
2 changes: 1 addition & 1 deletion tabix.1
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ start column. [5]
Force to overwrite the index file if it is present.
.TP
.BI "-m, --min-shift " INT
set minimal interval size for CSI indices to 2^INT [14]
Set minimal interval size for CSI indices to 2^INT [14]
.TP
.BI "-p, --preset " STR
Input format for indexing. Valid values are: gff, bed, sam, vcf.
Expand Down
5 changes: 4 additions & 1 deletion tbx.c
Original file line number Diff line number Diff line change
Expand Up @@ -321,8 +321,11 @@ tbx_t *tbx_index(BGZF *fp, int min_shift, const tbx_conf_t *conf)
continue;
}
if (first == 0) {
if (fmt == HTS_FMT_CSI)
if (fmt == HTS_FMT_CSI) {
if (!max_ref_len)
max_ref_len = (int64_t)100*1024*1024*1024; // 100G default
n_lvls = adjust_n_lvls(min_shift, n_lvls, max_ref_len);
}
tbx->idx = hts_idx_init(0, fmt, last_off, min_shift, n_lvls);
if (!tbx->idx) goto fail;
first = 1;
Expand Down

0 comments on commit 6366029

Please sign in to comment.