Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use allele info in VariantContext comparisons for stable sorts #1593

Merged
merged 3 commits into from
Oct 28, 2022

Conversation

clintval
Copy link
Contributor

@clintval clintval commented Feb 14, 2022

TLDR: This PR will make coordinate VCF sorting in HTSJDK/Picard similar to bcftools.

I tried to use Picard's SortVcf for a VCF comparison task, but I had difficulty because it's output was non-stable based on the full primary information of a variant context (contig, start, and alleles). I finally chose the bcftools implementation because it has a slightly more-stable sorting method. However, I want to ensure future users of Picard don't hit the same issue I did, hence this PR.

Before I write unit tests, is this something you are willing to accept?

@clintval clintval changed the title Use END and allele info in VariantContext comparisons for stable sorts Use allele info in VariantContext comparisons for stable sorts Feb 14, 2022
@codecov-commenter
Copy link

codecov-commenter commented Feb 14, 2022

Codecov Report

Merging #1593 (2272c16) into master (b5af659) will decrease coverage by 0.064%.
The diff coverage is 0.000%.

❗ Current head 2272c16 differs from pull request most recent head 5089e22. Consider uploading reports for the commit 5089e22 to get more accurate results

@@               Coverage Diff               @@
##              master     #1593       +/-   ##
===============================================
- Coverage     69.839%   69.775%   -0.064%     
+ Complexity      9647      9641        -6     
===============================================
  Files            702       702               
  Lines          37638     37535      -103     
  Branches        6113      6080       -33     
===============================================
- Hits           26286     26190       -96     
+ Misses          8903      8888       -15     
- Partials        2449      2457        +8     
Impacted Files Coverage Δ
...riant/variantcontext/VariantContextComparator.java 0.000% <0.000%> (ø)
...htsjdk/samtools/util/nio/DeleteOnExitPathHook.java 80.952% <0.000%> (-9.524%) ⬇️
...sjdk/samtools/util/htsget/HtsgetErrorResponse.java 73.333% <0.000%> (-6.667%) ⬇️
...tools/cram/encoding/core/GammaIntegerEncoding.java 86.667% <0.000%> (-6.667%) ⬇️
.../samtools/cram/compression/ExternalCompressor.java 69.565% <0.000%> (-4.509%) ⬇️
...mtools/cram/encoding/core/BetaIntegerEncoding.java 91.304% <0.000%> (-4.348%) ⬇️
...am/encoding/core/CanonicalHuffmanByteEncoding.java 52.941% <0.000%> (-2.941%) ⬇️
...ing/core/huffmanUtils/HuffmanParamsCalculator.java 80.000% <0.000%> (-1.818%) ⬇️
...va/htsjdk/samtools/util/htsget/HtsgetResponse.java 89.394% <0.000%> (-1.783%) ⬇️
...java/htsjdk/beta/codecs/variants/vcf/VCFCodec.java 81.818% <0.000%> (-1.515%) ⬇️
... and 39 more

@clintval
Copy link
Contributor Author

Any interest from the maintainers in this PR?

Anything I can do to help move it closer to merge?

@clintval
Copy link
Contributor Author

@cmnbroad I hit another instance where this would have helped me.

Do you want me to invest the time to implement some unit tests, or is there no interest in pursuing this PR?

@cmnbroad
Copy link
Collaborator

cmnbroad commented Sep 15, 2022

@clintval Stable sounds good to me, but @lbergelson who is out on vacation really needs to weigh in (I'm not actually a maintainer).

@lbergelson
Copy link
Member

@clintval I'm sorry I haven't been responsive. This is a good idea and having some tests would be great.

@clintval
Copy link
Contributor Author

@lbergelson I don't normally write in Java so I hope my last commits are OK!

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintval looks good to me! Thank you. I have a few minor things but I'll take them over and push my own patch since I've left you sitting for so long.

contigCompare = contigCompare == 0 ? firstVariantContext.getStart() - secondVariantContext.getStart() : contigCompare;
if (contigCompare == 0) {
// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts).
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is silly but's probably better to use getNAlleles() instead of getAlleles().size().

contigCompare = contigCompare == 0 ? firstVariantContext.getStart() - secondVariantContext.getStart() : contigCompare;
if (contigCompare == 0) {
// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts).
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any guarantee that alleles are sorted in any order. I think it would be ideal to treat them as logically the same if the have the same SET of alleles. It's going to be pretty rare that we get to this tie breaker so we could probably afford to sort them before we compare.

// Compare variants that have the same genomic span (chr:start-end) lexicographically by all alleles (ref and alts).
for (int i = 0; i < firstVariantContext.getAlleles().size(); i++) {
if (i > secondVariantContext.getAlleles().size()) { return 1; }
contigCompare = firstVariantContext.getAlleles().get(i).compareTo(secondVariantContext.getAlleles().get(i));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of reusing the variable contig compare it might make more sense to name it "alleleCompare"

@lbergelson lbergelson merged commit 14a8cc0 into samtools:master Oct 28, 2022
lbergelson added a commit that referenced this pull request Oct 28, 2022
* doing some refactoring to use the Comparator.comparing API
* adding a few more tests
lbergelson added a commit that referenced this pull request Oct 28, 2022
* Responding to my own pr comments on #1593
* doing some refactoring to use the Comparator.comparing API
* adding a few more tests
@clintval
Copy link
Contributor Author

@lbergelson nice work and thank you kindly for the review!

@clintval clintval deleted the cv_stable_vcf_sort branch October 31, 2022 01:44
lbergelson added a commit that referenced this pull request Nov 23, 2022
#1593)"

This reverts commit 14a8cc0.

This introduced a serious issue that causes existing valid VCFs to fail as unsorted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants