Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a bit more of the CRAM API. #1429

Merged
merged 1 commit into from
Jun 8, 2022

Conversation

jkbonfield
Copy link
Contributor

This is to enable samtools cram2ref PR.

Draft as although this works as-is, I'm thinking of adding io_lib's cram_size and cram_dump utilities too which will need further API additions.

If you wish each tool as its own PR to keep the size of this one day, then say so and I can take this out of draft, but if so then it needs merging promptly as I don't want to be making PR branches off another PR branch as that way lies merge hell.

@jkbonfield jkbonfield marked this pull request as ready for review May 27, 2022 11:39
@jkbonfield
Copy link
Contributor Author

I've rebased and undrafted this.

It's a good start, but I think more functionality will probably come a bit later so no point in waiting for now. My plan was to port over io_lib's cram_size and cram_dump tools, but they're in a bit of a mess and need a complete rewrite so it won't be soon. The main goal of this PR was to enable "samtools reference" as a way of extracting the reference directly from the embedded reference (or optionally from sequence + MD tags instead if present).

I also have an alternative tack for embedded references, visible at https://github.com/jkbonfield/htslib/tree/cram_embed_ref2, but it needs a bit more work still. This allows embed_ref=2 option level (perhaps to become the default) which generates the reference on-the-fly. This is better than using "samtools reference" as it's a one-pass solution, although the former still has merit. Unfortunately this other branch still needs some work, specifically when no MD tags are present it just generates it based on whatever sequence it saw first (not even a consensus), which works just fine for compression purposes, but in CRAM 3 there is no way to request we don't auto-generate an (incorrect) MD tag. The format just isn't capable of noting that, so it'd need some side-channel or additional aux tags, which is a bit htslib specific perhaps unless we also add them to the CRAM spec.

@jkbonfield jkbonfield mentioned this pull request May 30, 2022
@daviesrob daviesrob self-assigned this May 31, 2022
htslib/cram.h Outdated
*/
HTSLIB_EXPORT
void cram_slice_hdr_get_coords(cram_block_slice_hdr *h,
int *refid, int *start, int *span);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future-proofing, would it be better to make start and span type int64_t? Even though they can't get bigger than an int at the moment, it would be useful to not have to change the interface should a future version of CRAM support longer references (and CRAM already uses int64_t internally for these).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they semantically hts_pos_t?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I did think about suggesting that. But the CRAM internals, which this exposes, used int64_t so I thought that might be the better choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

I think hts_pos_t feels like the right type. The internals may change at some point, and it already does support 64-bit (so same thing basically between the two choices) for the CRAM 4 proto-spec. I think I may have copied the ints though from another API func, which is a bit more worrying. Will check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected to hts_pos_t and force pushed the new commit. The samtools side will need a minor tweak too no doubt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add, I was wrong in assuming I copied the API from something else. It looks likely just not thinking it through. The only other ints in the public CRAM API are to do with counts of smaller things like numbers of read-groups or slices within a container, or sizing of blocks/slices/containers.

Data sizes being 32-bit seems OK as realistically unless we've got 1Gbp long reads kicking around then restricting the unit of random access capability to a maximum of 2GB chunks seems reasonable (and it'd need a format change to fix that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have bigger problems than just this if we start getting 1Gbp reads...

This is to enable samtools cram2ref.
@daviesrob daviesrob merged commit f46597e into samtools:develop Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants