Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM/BAM sorting should follow @SQ order #823

Closed
laserson opened this issue Sep 17, 2015 · 11 comments
Closed

SAM/BAM sorting should follow @SQ order #823

laserson opened this issue Sep 17, 2015 · 11 comments

Comments

@laserson
Copy link
Contributor

Following up on discussion in #760 here:

#760 (comment)

Not clear that we cannot correctly implement the spec.

@ryan-williams
Copy link
Member

tl;dr: possible way to actually fix this issue: add an optional referenceIndex to SequenceRecord and Contig classes, populated only when alignments are read from a SAM file whose @SQs are not lex-sorted; when serializing a SequenceDictionary, sort by referenceIndex if present, otherwise lex-sort.

Many comments flying around various PRs and things atm but this seems like the best place to centralize discussion.

Here are three salient points that I see in this discussion:

  1. The SAM spec seems to explicitly provide for the possibility that @SQs will be deliberately sorted in a non-lexicographical order.
  2. We have no way today to keep track of the ancestral SAM-@SQ-line ordering once we've transformed into parquet format, so we've rallied around enforcing lexicographical ordering. Make transform round-trip emit the same BAM file #794 reached an impasse due to this and we basically gave up.
  3. [ADAM-822][ADAM-823] SAM follows @SQ order #824 is somewhat of a work-around to this loss of information, but it doesn't seem to me like it really fixes things; in particular, it leaves us unable to transform SAM -> ADAM -> SAM and have a non-lex-header-sorting be preserved, afaict (please correct me if I'm wrong).

However, I think there's a way we can actually fix this: add an (optional) referenceIndex to SequenceRecord and avro Contigs that is only populated when reading from a SAM whose @SQs are not lex-sorted (and maybe then only if a flag is passed, if we're worried about space cost).

Then, whenever we build a SequenceDictionary, we can order SequenceRecords by their referenceIndex fields (if present), and lex-sort them otherwise.

Curious to hear thoughts, lmk if this isn't clear.

@fnothaft
Copy link
Member

In full agreement with you @ryan-williams. Amusingly enough, back in the old days we did have referenceIndex as a field. We dropped it because of the difficulty of merging datasets* but I do agree that it would make sense to add those back as that is the most straightforward way to resolve the sort order problems (IMO, I've been having a similar mental conversation). I wouldn't necessarily make them optional; I might just give the option to renumber the indices if you want the full lexicographical sort.

* You had to remap indices if two referenceNames had different referenceIndices across the datasets you were merging, which was not the worst fate in the world, but was occasionally difficult to reason about.

@laserson
Copy link
Contributor Author

@fnothaft comment from #824:

I am -0 on this change, because it breaks #784

I wouldn't say "break" so much as "revert".

@fnothaft comment from #824:

this change only fixes #823 if you have the original SamFileHeader. You'll have this if you started from a SAM file, but you won't if you started from a Parquet file.

IMO that sounds fine. I don't think we should be shackled to SAM/BAM. In my mind, supporting export of BAM is merely good manners. I would be fine with letting go of a perfect BAM -> ADAM -> BAM roundtrip, as that's not what ADAM is really designed for.

@ryan-williams If we want to ensure that transform can read and write BAM in one shot, maybe we can provide a way to supply a header from the CLI (e.g., by pointing at some user-supplied file).

@fnothaft
Copy link
Member

I wouldn't say "break" so much as "revert".

Yeah, I'm was thinking about this more, and I'm a strong -1 on a revert of #784 as I am depending on that for a pipeline I am running with the UCSC folks.

IMO that sounds fine. I don't think we should be shackled to SAM/BAM. In my mind, supporting export of BAM is merely good manners. I would be fine with letting go of a perfect BAM -> ADAM -> BAM roundtrip, as that's not what ADAM is really designed for.

I think that it makes sense to be able to define an expected sequence index order; I'm not against that. My problem with this PR is that reverting #784 leaves you worse off RE: exporting ADAM to SAM/BAM. It only makes the BAM -> ADAM -> BAM single shot case better.

@ryan-williams If we want to ensure that transform can read and write BAM in one shot, maybe we can provide a way to supply a header from the CLI (e.g., by pointing at some user-supplied file).

IMO, this is a significantly worse solution than #784.

@ryan-williams
Copy link
Member

I think we're talking past each other a little, but I'm pretty sure we can have the best of all worlds here:

So you want to write a BAM from an RDD[AlignmentRecord]

Case 1: we have a defined @SQ ordering, via referenceIndexs on all Contigs

This is the easy case! We can preserve the original ordering, or impose a lex-sort if we want!

Choose one of these as default, and provide a flag enabling the other.

My vote: preserve original ordering as default; allow a flag to forcibly sort the @SQs.

Case 2: we don't have a defined @SQ ordering; referenceIndexs are not set on any Contigs.

This case is also easy; we should lex-sort them. #784 survives!

Case 3: some Contigs have a referenceIndex, some don't.

This case feels least important and more like a sub-case of case 2, i.e. the decision tree may just be: "do all Contigs have a referenceIndex or not?"

Many transformations to RDD[AlignmentRecord]s may result in the referenceIndex ordering becoming lost/meaningless, e.g. merging with other RDD[Alignment]s. In these cases I think it makes sense to fall back to lex-sort.

Addendum: to set Contig.referenceIndex when the Contigs are lex-sorted, or not?

This will (hopefully) be the most common case: BAMs come in already lex-sorted.

I think we should just rely on the default lex-sorting (case 2 / #784) and not waste space writing an extra, unnecessary field on every contig; why bother?

@laserson
Copy link
Contributor Author

+1 @ryan-williams. Let people impose the order, and if not, make it deterministic via sort. It would be great to avoid the aggregation required to build the header if you know your data has a well-defined @SQ ordering.

@fnothaft "break" and "revert" are both too strong :) We can definitely support both.

@fnothaft
Copy link
Member

@fnothaft "break" and "revert" are both too strong :) We can definitely support both.

Supporting both is my preference.

@ryan-williams
Copy link
Member

Supporting both is my preference.

I feel like we're all agreed then, modulo the question of which to default to in the presence of an existing non-lex sort order; as I said in my last comment, I vote that we use the existing sort-order when it is present and complete, and support a "sort" flag to proactively lex-sort in this case (should be easy given the existing -sort_reads plumbing in transform.

Here are the things that need to happen, then:

  • add referenceIndex to Contig
  • release bdg-formats version with the above
  • upgrade ADAM to this bdg-formats
  • build support into ADAM for persisting and using Contig.referenceIndex, where possible.

These should all be pretty straightforward steps, and I'm up for getting started on them if this plan sounds good to people. Please holler if you'd like to do work on it instead / in parallel.

One aside is that I don't foresee us wanting the header-injection approach from #824 in this world; lmk if you still feel like it's useful in a world where we embark on the above plan @laserson.

@fnothaft
Copy link
Member

@ryan-williams that SGTM

@ryan-williams
Copy link
Member

I've just sent fnothaft#7 based on #815 which fixes this and #822 (I think; haven't specifically tested the latter).

@fnothaft
Copy link
Member

Thanks @ryan-williams ! I just merged fnothaft#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants