-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a bit more of the CRAM API. #1429
Conversation
12c6cc0
to
cc43900
Compare
I've rebased and undrafted this. It's a good start, but I think more functionality will probably come a bit later so no point in waiting for now. My plan was to port over io_lib's cram_size and cram_dump tools, but they're in a bit of a mess and need a complete rewrite so it won't be soon. The main goal of this PR was to enable "samtools reference" as a way of extracting the reference directly from the embedded reference (or optionally from sequence + MD tags instead if present). I also have an alternative tack for embedded references, visible at https://github.com/jkbonfield/htslib/tree/cram_embed_ref2, but it needs a bit more work still. This allows |
htslib/cram.h
Outdated
*/ | ||
HTSLIB_EXPORT | ||
void cram_slice_hdr_get_coords(cram_block_slice_hdr *h, | ||
int *refid, int *start, int *span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future-proofing, would it be better to make start
and span
type int64_t
? Even though they can't get bigger than an int
at the moment, it would be useful to not have to change the interface should a future version of CRAM support longer references (and CRAM already uses int64_t
internally for these).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they semantically hts_pos_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I did think about suggesting that. But the CRAM internals, which this exposes, used int64_t
so I thought that might be the better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I think hts_pos_t
feels like the right type. The internals may change at some point, and it already does support 64-bit (so same thing basically between the two choices) for the CRAM 4 proto-spec. I think I may have copied the int
s though from another API func, which is a bit more worrying. Will check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected to hts_pos_t
and force pushed the new commit. The samtools side will need a minor tweak too no doubt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add, I was wrong in assuming I copied the API from something else. It looks likely just not thinking it through. The only other ints in the public CRAM API are to do with counts of smaller things like numbers of read-groups or slices within a container, or sizing of blocks/slices/containers.
Data sizes being 32-bit seems OK as realistically unless we've got 1Gbp long reads kicking around then restricting the unit of random access capability to a maximum of 2GB chunks seems reasonable (and it'd need a format change to fix that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have bigger problems than just this if we start getting 1Gbp reads...
cc43900
to
f3fd29b
Compare
This is to enable samtools cram2ref.
f3fd29b
to
f46597e
Compare
This is to enable samtools cram2ref PR.
Draft as although this works as-is, I'm thinking of adding io_lib's cram_size and cram_dump utilities too which will need further API additions.
If you wish each tool as its own PR to keep the size of this one day, then say so and I can take this out of draft, but if so then it needs merging promptly as I don't want to be making PR branches off another PR branch as that way lies merge hell.