Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly treat blank ancestral alleles #963

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

hyanwong
Copy link
Member

@hyanwong hyanwong commented Sep 4, 2024

There was a bug in the previous code: since "" is the fill-value for alleles, if an ancestral allele was "", it wasn't treated as missing.

This PR fixes that bug, but also fixes #962: the string "" "N" is treated as meaning "deliberately set the ancestral allele as unknown". This is OK because "" is not a valid VCZ allele string. Therefore if the only ancestral alleles that don't match are "", then it's confusing to issue a warning, because setting unknown alleles is a deliberate user choice.

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (1d04fb8) to head (15705f9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
tsinfer/formats.py 97.43% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #963      +/-   ##
==========================================
+ Coverage   93.13%   93.15%   +0.01%     
==========================================
  Files          18       18              
  Lines        6354     6368      +14     
  Branches     1136     1142       +6     
==========================================
+ Hits         5918     5932      +14     
  Misses        296      296              
  Partials      140      140              
Flag Coverage Δ
C 93.15% <97.50%> (+0.01%) ⬆️
python 95.50% <97.50%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @hyanwong

I've thought about it more and think that unknown should be explicitly N or another character as that is usually the case in FASTA etc. "" would probably only get into the ancestral allele as a mistake right?

@jeromekelleher
Copy link
Member

I don't think we should assign any meaning to the empty string - as @benjeffery "N" is a much better signal for unknown from the FASTA.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 5, 2024

The problem with "N" is surely that it could match "N" in the actual VCF? It would require extra hardcoded logic not to match "N" against existing alleles. It seemed a bit neater to me to pick something that explicitly isn't allowed as an allele in a VCF file.

I guess it's fine hardcoding a specific string like N as long as it's part of the VCF spec and unlikely to change? I'm idly wondering why N rather than e.g. . or n or -, or whatever all the other conventions are? I'm no VCF expert, so happy to defer to people who are.

Note that inference will find a different ancestral allele from any what we set here.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 5, 2024

Just looking at how to do this. What alleles strings do we want to treat as "unknown ancestral state"? I assume "N", "n" and "". The VCF spec says nothing about "." in the REF and ALT fields (it's only in the GT field).

If we want "N" and "n" to be the "deliberately unknown" state, we would skip the warning if these were the only non-matching alleles.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 5, 2024

I changed the behaviour to using (and documenting) N as the standard unknown ancestral allele. I also wrote the docstring from VariantData, so that this behaviour is described. It would be good to merge this, so that I can start adding documentation for the other changes such as contig_id that I've been making.

@hyanwong hyanwong force-pushed the no-warn-blank-aa branch 5 times, most recently from 121b11d to 76a155f Compare September 5, 2024 20:59
@hyanwong
Copy link
Member Author

hyanwong commented Sep 6, 2024

I think this is as you wanted now @jeromekelleher (N or n are the "unknown" ancestral state, with warnings for anything else): it's a pretty simple change. Do you want to quickly review?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just needs a bit of simplification

except IndexError:
unknown_alleles[allele] += 1
converted[i] = allele_index
if not (allele in {"", "N", "n"}): # All these must represent unknown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't treat empty string as legal input here as it has a defined meaning in VCZ

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now done. See comment below about changing the name to "ancestral_state", to match what we used previously.

@hyanwong hyanwong force-pushed the no-warn-blank-aa branch 2 times, most recently from d4c8f49 to fc16222 Compare September 9, 2024 21:57
@hyanwong
Copy link
Member Author

hyanwong commented Sep 9, 2024

As discussed with @jeromekelleher: we warn on unknown ancestral states, unless they are all "N" or "n", in which case we merely output info.

Note that after discussion with @benjeffery, this also changes the parameter name to "ancestral_state" (to match tskit, and so that it is clearer that it differs from the index (which is known as the "ancestral allele"). It also fixes #927. I have tested it on a 200Mb 1000 genomes VCF, and the check does not noticeably increase the construction time of the VariantData object (which in this case was ~0.7 secs)

@hyanwong
Copy link
Member Author

hyanwong commented Sep 9, 2024

Also note that I duplicated the sites() constructor so that in the VCF Zarr case, it trims the allele list so that it doesn't include the padding values. This can then be used when checking the ancestral allele states (and is fast)

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mergify mergify bot merged commit e542f7c into tskit-dev:main Sep 10, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not warn about unknown states if ancestral allele is "N"
3 participants