-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it possible to collapse SVs within known groups #228
Comments
Hello, What you're describing is possible, but not 'out of the box'. The first idea you could try would be to create a custom script that will extract these multi-allelic records into their own individual VCFs, run bcftools to split them into bi-allelics, and run The second idea you could try would take much more work, but would be 'cleaner' is to use the truvari api's objects/methods which do the matching of variants. A rough outline of what that custom tool would look like would be: Code Removed To Next Comment For your example VCF line, the script writes:
Meaning half of the alleles are over 90% similar to another. The output VCF is:
Which changes the
As you can see, most of the work for variant comparison isn't actually comparing the variants. Instead, it's tracking the network of matches, figuring out how they should be resolved, and correctly altering the VCF. An example of this difficulty is that this script just keeps the first allele of each collapse set. So for the If I were to put this functionality into truvari, we'd need to track all of this information within AND between multi-allelic sites, which becomes a much more difficult problem. So truvari only supports bi-allelic vcf entries officially to keep the tool maintainable. |
Update: That script didn't work. The example VCF entry has a special variant type which I call REPL and doesn't work automatically with Truvari. I've updated the script below so that it will work with the example VCF entry and I went ahead and made it choose the most common representation, as well. Log output:
import sys
import logging
import itertools
import pysam
import truvari
import numpy as np
from collections import Counter
def get_ac(entry):
"""
Allele count
"""
ret = Counter()
for s in entry.samples.values():
for i in s['GT']:
ret[i] += 1
return ret
def expand_entries(entry, n_header):
"""
Creates 'new' vcf entries for each allele
"""
ret = []
for pos, alt in enumerate(entry.alts):
n = entry.copy()
n.translate(n_header)
n.alts = (alt,)
# We'll consider all variants to be REPlacement types
n.info['SVTYPE'] = 'REP'
# And we'll consider their length to be the absolute span
# instead of the default len(REF) - len(ALT)
n.info['SVLEN'] = len(alt)
ret.append(n)
return ret
def build_match_matrix(entries, matcher):
"""
Compare all of the alt alleles to one-another.
"""
n_entries = len(entries)
match_matrix = np.zeros((n_entries, n_entries), dtype=bool)
for i in range(n_entries - 1):
for j in range(i + 1, n_entries):
result = matcher.build_match(
entries[i], entries[j], skip_gt=True, short_circuit=True)
# comparing A->B is the same as B->A
match_matrix[i, j] = result.state
match_matrix[j, i] = result.state
return match_matrix
def find_matching_sets(matrix, acs):
"""
Creates a lookup of which alleles match
returns the new allele numbers lookup as dict and a list of original
allele numbers to keep
"""
n = len(matrix)
visited = [False] * n
matched_sets = []
def dfs(item, current_set):
"""
Depth first search to find chain of matches
"""
visited[item] = True
current_set.append(item + 1) # alt alleles start at number 1
for other in range(n):
if matrix[item][other] and not visited[other]:
dfs(other, current_set)
for i in range(n):
if not visited[i]:
current_set = []
dfs(i, current_set)
matched_sets.append(current_set)
logging.info("matched_sets %s", matched_sets)
# Keep the most common allele in each set
to_keep = [max(idx, key=lambda i: acs[i]) for idx in matched_sets]
# Create a lookup of old allele number to the new allele number
to_rename = {old_num: new_num + 1
for new_num, entry_set in enumerate(matched_sets)
for old_num in entry_set}
# Handle missing './.' alleles
to_rename[None] = None
return to_rename, to_keep
def update_entry(entry, to_rename, to_keep):
"""
Update the original entry by renaming and removing allele numbers
changes made in-place
"""
# Update the entry's samples
for sample in entry.samples:
# Have to track phasing
is_phased = entry.samples[sample].phased
entry.samples[sample]['GT'] = tuple(
map(to_rename.get, entry.samples[sample]['GT']))
# And put phasing back in
entry.samples[sample].phased = is_phased
# And remove the collapsed alleles
# -1 because we're tracking allele number, not alt index
entry.alts = (entry.alts[_ - 1] for _ in to_keep)
if __name__ == '__main__':
# Input/output VCFs
vcf = pysam.VariantFile(sys.argv[1])
# Need this so we can put information into the expanded vcf entries
n_header = vcf.header.copy()
n_header.add_line('##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="SV length">')
n_header.add_line('##INFO=<ID=SVTYPE,Number=.,Type=String,Description="SV Type">')
out_vcf = pysam.VariantFile("output.vcf", 'w', header=vcf.header)
truvari.setup_logging()
matcher = truvari.Matcher()
matcher.params.pctseq = 0.90
matcher.params.pctsize = 0.90
for entry in vcf:
acs = get_ac(entry)
entries = expand_entries(entry, n_header)
match_matrix = build_match_matrix(entries, matcher)
to_rename, to_keep = find_matching_sets(match_matrix, acs)
logging.info("%d alt alleles became %d", len(entries), len(to_keep))
update_entry(entry, to_rename, to_keep)
out_vcf.write(entry) |
Thank you so much for you explanation and code! This is super helpful! I will carefully look into the code and truvari's api. |
Hi @ACEnglish , I read the api and some code of truvari's matching. I am modifying your script to fit it into my pipeline, but I have some additional questions regarding the SV matching strategy for different SVs. For most INS and DEL, if their length of REF and ALT differ a lot, would it be better to still use the default size definition by For SVs with similar length of REF and ALT alleles (e.g., 90*A -> 100*T vs 110*A -> 100*T at the same start pos), if set
Besides, for sequence similarity comparison, which method (unroll or reference context) is more accurate? I only need to match SVs once, so running time is not very important. I read this gist where you explained how unroll works for tandem repeat. But I still don't quite understand how it works for general insertions. |
Easy question first:
For the type/length considerations, I'm not familiar with your outputs enough to be sure how to advise. In the original script, I assumed that it was multi-allelic such that every variant which starts at the same position was one of the ALTs. Therefore, the REF would be expanded to span the longest REF length. In your example with 90,100,110 Ts and As:
But since you've asked the question, I assume this isn't what's already happening, so I'm unsure. Similarly, I assumed that these were normalized such that there are no overlapping variants, so So it sounds like you're dealing with a data structure I don't have experience with, so you'll need to explore different ideas for how to handle the cases. Since you're making your own script, you can potentially identify the possibilities and create branches handling each. e.g. slightly overlapping: 'normalize' before sending to the main collapse. Highly different types/sizes, send to a different collapse |
Thank you so much for your explanation for the unroll method. For SV matching, I am sorry that I didn't clearly describe it in the first post. I showed that example to illustrate that, in a pangenome VCF, there are many multi-allelic records with similar alleles. Therefore, I want to collapse the SVs within the same multi-allelic record. But many of the alleles are due to nested SNPs or small INDELs. Therefore, we usually first decompose the large multi-allelic records and split it into many bi-allelic records. Actually, that example was not good, because after decomposition and normalization, there is no SVs need collapse. Multi-allelic VCF:
Decomposed and normalized VCF:
For a better example, you can refer to this input test.vcf.gz.txt and collapse output test_collapse.vcf.gz.txt (I removed the genotypes to make it smaller). Using default parameters, truvari works well for most SVs:
However, it cannot handle some complex SVs. For example, there are 36 SVs have REF/ALT > 50 bp but SVLEN<50bp. Those SVs are skipped by truvari due to small SVLEN. For such SVs, do you think it would be better to let |
If you've split multi-allelics, then your SVLEN should reflect the length of the single, bi-allelic variant. Perhaps splitting caused some problems and therefore you should run |
Hi,
I would like to use Truvari to merge SVs deconstructed from a pangenome graph. The VCF usually has many records with a lot of similar alleles (e.g., only 1 bp difference). An example from HPRC VCF is given below:
Because all the records in the VCF are non-overlapping, I expect that most of the redundant SVs are in the same multi-allelic record. Therefore, I would like to collapse alleles within a multi-allelic record.
I understand that Truvari don't process all alleles in a multi-allelic record. But if I assign a unique group ID for each multi-allelic record (e.g., INFO/GROUP) and split it into bi-allelic records, would it possible for Truvari to only compare SVs within the same group and then collapse them as usual?
Thanks a lot!
The text was updated successfully, but these errors were encountered: