Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI file is BGZF compressed but this is not mentioned in the CSV1 spec #765

Open
bguo068 opened this issue Apr 13, 2024 · 2 comments
Open

Comments

@bguo068
Copy link

bguo068 commented Apr 13, 2024

I used bcftools 1.19 to index a BCF file and tried to parse the CSI index file according to the spec https://github.com/samtools/hts-specs/blob/26347448cadff3cf40982d60fe2a97f20d2543ea/CSIv1.tex#L20C28-L20C33. It was not working as expected. After hexdump -C on the csi file, I realized it not a plain binary file as described in CSIv1 spec file.

00000000  1f 8b 08 04 00 00 00 00  00 ff 06 00 42 43 02 00  |............BC..|
00000010  46 00 73 0e f6 64 e4 63  60 60 60 66 80 00 01 20  |F.s..d.c```f... |
00000020  66 02 62 4f 20 e6 11 86  88 31 22 b1 19 18 0a 0d  |f.bO ....1".....|
...

But it seem consist with the spec after decompressing it bgzip -cd test.bcf.csi | hexdump -C:

00000000  43 53 49 01 0e 00 00 00  03 00 00 00 00 00 00 00  |CSI.............|
00000010  10 00 00 00 02 00 00 00  49 00 00 00 09 18 00 00  |........I.......|

Could we add a sentence in the spec to point this out for future readers? Or it is not part of the spec?

@jkbonfield
Copy link
Contributor

While I agree adding this would be beneficial, it's the least problematic bit about the spec!

It certainly would be good if the original authors could add more about it. One thing that confused me lots is the "Auxiliary data", which changes format depending on the thing being indexed. (IIRC it's tabix data for VCF and some BAI-related format for BCF). I assume it's meant to be generic, but it also makes it largely unparseable without custom knowledge.

Ping @lh3 @pd3: is there any more information on CSI somewhere else? It looks like it arrived with this commit and subsequent commits. This appears to be where the original minimal spec documentation came from too.

@zaeleus
Copy link

zaeleus commented Apr 15, 2024

See also #70, a long-standing issue noting this:

It is clear from examination of .csi files that they are stored as BGZF (why?), although this is not mentioned and is at odds with the current behaviour of BAI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants