-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use allele info in VariantContext comparisons for stable sorts #1593
Conversation
c6f2029
to
a79b421
Compare
Codecov Report
@@ Coverage Diff @@
## master #1593 +/- ##
===============================================
- Coverage 69.839% 69.775% -0.064%
+ Complexity 9647 9641 -6
===============================================
Files 702 702
Lines 37638 37535 -103
Branches 6113 6080 -33
===============================================
- Hits 26286 26190 -96
+ Misses 8903 8888 -15
- Partials 2449 2457 +8
|
Any interest from the maintainers in this PR? Anything I can do to help move it closer to merge? |
@cmnbroad I hit another instance where this would have helped me. Do you want me to invest the time to implement some unit tests, or is there no interest in pursuing this PR? |
@clintval Stable sounds good to me, but @lbergelson who is out on vacation really needs to weigh in (I'm not actually a maintainer). |
@clintval I'm sorry I haven't been responsive. This is a good idea and having some tests would be great. |
@lbergelson I don't normally write in Java so I hope my last commits are OK! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintval looks good to me! Thank you. I have a few minor things but I'll take them over and push my own patch since I've left you sitting for so long.
contigCompare = contigCompare == 0 ? firstVariantContext.getStart() - secondVariantContext.getStart() : contigCompare; | ||
if (contigCompare == 0) { | ||
// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts). | ||
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is silly but's probably better to use getNAlleles() instead of getAlleles().size().
contigCompare = contigCompare == 0 ? firstVariantContext.getStart() - secondVariantContext.getStart() : contigCompare; | ||
if (contigCompare == 0) { | ||
// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts). | ||
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any guarantee that alleles are sorted in any order. I think it would be ideal to treat them as logically the same if the have the same SET of alleles. It's going to be pretty rare that we get to this tie breaker so we could probably afford to sort them before we compare.
// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts). | ||
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) { | ||
if (i > secondVariantContext.getAlleles().size()) { return 1; } | ||
contigCompare = firstVariantContext.getAlleles().get(i).compareTo(secondVariantContext.getAlleles().get(i)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of reusing the variable contig compare it might make more sense to name it "alleleCompare"
* doing some refactoring to use the Comparator.comparing API * adding a few more tests
* Responding to my own pr comments on #1593 * doing some refactoring to use the Comparator.comparing API * adding a few more tests
@lbergelson nice work and thank you kindly for the review! |
TLDR: This PR will make coordinate VCF sorting in HTSJDK/Picard similar to bcftools.
I tried to use Picard's SortVcf for a VCF comparison task, but I had difficulty because it's output was non-stable based on the full primary information of a variant context (contig, start, and alleles). I finally chose the
bcftools
implementation because it has a slightly more-stable sorting method. However, I want to ensure future users of Picard don't hit the same issue I did, hence this PR.Before I write unit tests, is this something you are willing to accept?