Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to collapse SVs within known groups #228

Closed
Han-Cao opened this issue Aug 22, 2024 · 7 comments
Closed

is it possible to collapse SVs within known groups #228

Han-Cao opened this issue Aug 22, 2024 · 7 comments

Comments

@Han-Cao
Copy link

Han-Cao commented Aug 22, 2024

Hi,

I would like to use Truvari to merge SVs deconstructed from a pangenome graph. The VCF usually has many records with a lot of similar alleles (e.g., only 1 bp difference). An example from HPRC VCF is given below:

chr1	591437	>26051>26006	TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA	T,TAGAAGGAATAAGACCGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGCGCGGTGGCTCACGCCTGGAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCAGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACCGCACTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCAGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAA	60	.	AC=36,4,1,1,1,1,25,9;AF=0.461538,0.0512821,0.0128205,0.0128205,0.0128205,0.0128205,0.320513,0.115385;AN=78;AT=>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26006,>26051>26050<26049>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26046>26044>26042>26041>26040>26038>26036>26035>26033>26032>26030>26029>26028>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26050>26048>26047>26045>26044>26043>26041>26039>26038>26036>26035>26034>26032>26031>26029>26027>26026>26024>26023>26022>26020>26019>26017>26016>26014>26013>26011>26010<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26037>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26025>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26008>26007>26006;NS=43;LV=0	GT	1	.|.	8|8	1|7	7|7	2|7	2|2	7|7	7|7	.|1	.|1	7|.	3|1	7|.	7|1	1|2	1|1	7|7	7|7	7|7	1|7	5|8	1|7	1|8	8|7	.|7	8|8	1|.	1|1	7|1	1|1	1|1	7|7	4|1	1|1	.|8	1|8	1|1	.|.	1|1	1|1	1|1	1|1	1|7	6|1

Because all the records in the VCF are non-overlapping, I expect that most of the redundant SVs are in the same multi-allelic record. Therefore, I would like to collapse alleles within a multi-allelic record.

I understand that Truvari don't process all alleles in a multi-allelic record. But if I assign a unique group ID for each multi-allelic record (e.g., INFO/GROUP) and split it into bi-allelic records, would it possible for Truvari to only compare SVs within the same group and then collapse them as usual?

Thanks a lot!

@ACEnglish
Copy link
Owner

ACEnglish commented Aug 24, 2024

Hello,

What you're describing is possible, but not 'out of the box'.

The first idea you could try would be to create a custom script that will extract these multi-allelic records into their own individual VCFs, run bcftools to split them into bi-allelics, and run truvari collapse on each individual VCF before recombining the output.

The second idea you could try would take much more work, but would be 'cleaner' is to use the truvari api's objects/methods which do the matching of variants. A rough outline of what that custom tool would look like would be:

Code Removed To Next Comment

For your example VCF line, the script writes:

[[None, False, False, False, False, False, False, False],
 [False, None, False, False, True, False, True, False],
 [False, False, None, False, False, True, False, False],
 [False, False, False, None, False, False, False, False],
 [False, True, False, False, None, False, True, False],
 [False, False, True, False, False, None, False, False],
 [False, True, False, False, True, False, None, False],
 [False, False, False, False, False, False, False, None]]
matching_sets [{1}, {2, 5, 7}, {3, 6}, {4}, {8}]

Meaning half of the alleles are over 90% similar to another. The output VCF is:

chr1	591437	>26051>26006	TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA	T,TAGAAGGAATAAGACCGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGCGCGGTGGCTCACGCCTGGAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCAGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACCGCACTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA	60	.	.	GT	1	.	5|5	1|2	2|2	2|2	2|2	2|2	2|2	1	1	2	3|1	2	2|1	1|2	1|1	2|2	2|2	2|2	1|2	2|5	1|2	1|5	5|2	2	5|5	1	1|1	2|1	1|1	1|1	2|2	4|1	1|1	5	1|5	1|1	.	1|1	1|1	1|1	1|1	1|2	3|1

Which changes the INFO/AF:

Original: 0.461538,0.0512821,0.0128205,0.0128205,0.0128205,0.0128205,0.320513,0.115385
Collapsed: 0.461538,0.384615,0.025641,0.0128205,0.115385

As you can see, most of the work for variant comparison isn't actually comparing the variants. Instead, it's tracking the network of matches, figuring out how they should be resolved, and correctly altering the VCF.

An example of this difficulty is that this script just keeps the first allele of each collapse set. So for the {2, 5, 7} set, it keeps the 2nd alt allele. However, a better functionality would be to keep the more common allele i.e. 7: AF=0.32 since that one is a more confident representation for the allele. Another example is that this will only work on VCFs without INFO/SVLEN or INFO/SVTYPE because those fields can also correspond to allele indices, which aren't tracked inside the Matcher. So there would have to be more work done by expand_entries to correct that.

If I were to put this functionality into truvari, we'd need to track all of this information within AND between multi-allelic sites, which becomes a much more difficult problem. So truvari only supports bi-allelic vcf entries officially to keep the tool maintainable.

@ACEnglish
Copy link
Owner

ACEnglish commented Aug 24, 2024

Update: That script didn't work. The example VCF entry has a special variant type which I call REPL and doesn't work automatically with Truvari. I've updated the script below so that it will work with the example VCF entry and I went ahead and made it choose the most common representation, as well.

Log output:

2024-08-24 16:53:20,808 [INFO] matched_sets [[1], [2, 3, 4, 5, 6, 7, 8]]
2024-08-24 16:53:20,808 [INFO] 8 alt alleles became 2
import sys
import logging
import itertools
import pysam
import truvari
import numpy as np
from collections import Counter


def get_ac(entry):
    """
    Allele count
    """
    ret = Counter()
    for s in entry.samples.values():
        for i in s['GT']:
            ret[i] += 1
    return ret


def expand_entries(entry, n_header):
    """
    Creates 'new' vcf entries for each allele
    """
    ret = []
    for pos, alt in enumerate(entry.alts):
        n = entry.copy()
        n.translate(n_header)
        n.alts = (alt,)
        # We'll consider all variants to be REPlacement types 
        n.info['SVTYPE'] = 'REP'
        # And we'll consider their length to be the absolute span
        # instead of the default len(REF) - len(ALT)
        n.info['SVLEN'] = len(alt)
        ret.append(n)
    return ret


def build_match_matrix(entries, matcher):
    """
    Compare all of the alt alleles to one-another.
    """
    n_entries = len(entries)
    match_matrix = np.zeros((n_entries, n_entries), dtype=bool)
    for i in range(n_entries - 1):
        for j in range(i + 1, n_entries):
            result = matcher.build_match(
                entries[i], entries[j], skip_gt=True, short_circuit=True)
            # comparing A->B is the same as B->A
            match_matrix[i, j] = result.state
            match_matrix[j, i] = result.state
    return match_matrix


def find_matching_sets(matrix, acs):
    """
    Creates a lookup of which alleles match
    returns the new allele numbers lookup as dict and a list of original
    allele numbers to keep
    """
    n = len(matrix)
    visited = [False] * n
    matched_sets = []

    def dfs(item, current_set):
        """
        Depth first search to find chain of matches
        """
        visited[item] = True
        current_set.append(item + 1)  # alt alleles start at number 1
        for other in range(n):
            if matrix[item][other] and not visited[other]:
                dfs(other, current_set)

    for i in range(n):
        if not visited[i]:
            current_set = []
            dfs(i, current_set)
            matched_sets.append(current_set)
    logging.info("matched_sets %s", matched_sets)
    # Keep the most common allele in each set
    to_keep = [max(idx, key=lambda i: acs[i]) for idx in matched_sets]
    # Create a lookup of old allele number to the new allele number
    to_rename = {old_num: new_num + 1
                 for new_num, entry_set in enumerate(matched_sets)
                 for old_num in entry_set}
    # Handle missing './.' alleles
    to_rename[None] = None

    return to_rename, to_keep


def update_entry(entry, to_rename, to_keep):
    """
    Update the original entry by renaming and removing allele numbers
    changes made in-place
    """
    # Update the entry's samples
    for sample in entry.samples:
        # Have to track phasing
        is_phased = entry.samples[sample].phased
        entry.samples[sample]['GT'] = tuple(
            map(to_rename.get, entry.samples[sample]['GT']))
        # And put phasing back in
        entry.samples[sample].phased = is_phased

    # And remove the collapsed alleles
    # -1 because we're tracking allele number, not alt index
    entry.alts = (entry.alts[_ - 1] for _ in to_keep)


if __name__ == '__main__':
    # Input/output VCFs
    vcf = pysam.VariantFile(sys.argv[1])
    # Need this so we can put information into the expanded vcf entries
    n_header = vcf.header.copy()
    n_header.add_line('##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="SV length">')
    n_header.add_line('##INFO=<ID=SVTYPE,Number=.,Type=String,Description="SV Type">')

    out_vcf = pysam.VariantFile("output.vcf", 'w', header=vcf.header)

    truvari.setup_logging()
    matcher = truvari.Matcher()
    matcher.params.pctseq = 0.90
    matcher.params.pctsize = 0.90

    for entry in vcf:
        acs = get_ac(entry)
        entries = expand_entries(entry, n_header)
        match_matrix = build_match_matrix(entries, matcher)
        to_rename, to_keep = find_matching_sets(match_matrix, acs)
        logging.info("%d alt alleles became %d", len(entries),  len(to_keep))
        update_entry(entry, to_rename, to_keep)
        out_vcf.write(entry)

@Han-Cao
Copy link
Author

Han-Cao commented Aug 26, 2024

Thank you so much for you explanation and code! This is super helpful! I will carefully look into the code and truvari's api.

@Han-Cao
Copy link
Author

Han-Cao commented Sep 3, 2024

Hi @ACEnglish ,

I read the api and some code of truvari's matching. I am modifying your script to fit it into my pipeline, but I have some additional questions regarding the SV matching strategy for different SVs.

For most INS and DEL, if their length of REF and ALT differ a lot, would it be better to still use the default size definition by SVLEN = len(alt) - len(ref)?

For SVs with similar length of REF and ALT alleles (e.g., 90*A -> 100*T vs 110*A -> 100*T at the same start pos), if set SVLEN = len(ALT), both size and length similarity = 100%. I am wondering if it is better to set SVLEN = len(REF), SVTYPE = "REP", then use --pctovl --pctsize to limit REF similarity, and use --pctseq to limit ALT similarity?

variant.alleles = (90 * 'A', 100*'T')
variant.info['SVTYPE'] = 'REP'
variant2 = variant.copy()
variant2.alleles =  (110 * 'A', 100*'T')

variant.info['SVLEN'] = len(variant.alts[0])
variant2.info['SVLEN'] = len(variant2.alts[0])
ret1 = matcher.build_match(variant, variant2, skip_gt=True)

variant.info['SVLEN'] = len(variant.ref)
variant2.info['SVLEN'] = len(variant2.ref)
ret2 = matcher.build_match(variant, variant2, skip_gt=True)

# output
ret1: <truvari.MatchResult (True 93.939)>
ret2: <truvari.MatchResult (False 87.879)>

Besides, for sequence similarity comparison, which method (unroll or reference context) is more accurate? I only need to match SVs once, so running time is not very important. I read this gist where you explained how unroll works for tandem repeat. But I still don't quite understand how it works for general insertions.

@ACEnglish
Copy link
Owner

Easy question first:
Use unroll. It is more accurate and faster. While TRs account for ~70% of SVs, unroll can be useful outside of TRs for example when a shift in boundaries happens due to directed repeats at the breakends.

DEL1    NNACAC-------NN
DEL2    NN-------ACACNN
REF     NNACACTTTACACNN

Both unroll to ACACTTT or TTTACAC depending on rolling up/down

For the type/length considerations, I'm not familiar with your outputs enough to be sure how to advise. In the original script, I assumed that it was multi-allelic such that every variant which starts at the same position was one of the ALTs. Therefore, the REF would be expanded to span the longest REF length. In your example with 90,100,110 Ts and As:

ref_seq = 110 * 'A'
alt_1 = (100 * 'T') + (20 * 'A')   # or 20A:100T in some cases?
alt_2 = 100 * 'T'

But since you've asked the question, I assume this isn't what's already happening, so I'm unsure. Similarly, I assumed that these were normalized such that there are no overlapping variants, so pctovl wouldn't be of any use.

So it sounds like you're dealing with a data structure I don't have experience with, so you'll need to explore different ideas for how to handle the cases. Since you're making your own script, you can potentially identify the possibilities and create branches handling each. e.g. slightly overlapping: 'normalize' before sending to the main collapse. Highly different types/sizes, send to a different collapse

@Han-Cao
Copy link
Author

Han-Cao commented Sep 4, 2024

Thank you so much for your explanation for the unroll method.

For SV matching, I am sorry that I didn't clearly describe it in the first post. I showed that example to illustrate that, in a pangenome VCF, there are many multi-allelic records with similar alleles. Therefore, I want to collapse the SVs within the same multi-allelic record. But many of the alleles are due to nested SNPs or small INDELs. Therefore, we usually first decompose the large multi-allelic records and split it into many bi-allelic records. Actually, that example was not good, because after decomposition and normalization, there is no SVs need collapse.

Multi-allelic VCF:

chr1	591437	>26051>26006	TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA	T,TAGAAGGAATAAGACCGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGCGCGGTGGCTCACGCCTGGAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGTGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCAGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACCGCACTCCAGCCTGGGCGACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCAGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAAA,TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAAA	60	.	AC=36,4,1,1,1,1,25,9;AF=0.461538,0.0512821,0.0128205,0.0128205,0.0128205,0.0128205,0.320513,0.115385;AN=78;AT=>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26006,>26051>26050<26049>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26046>26044>26042>26041>26040>26038>26036>26035>26033>26032>26030>26029>26028>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26050>26048>26047>26045>26044>26043>26041>26039>26038>26036>26035>26034>26032>26031>26029>26027>26026>26024>26023>26022>26020>26019>26017>26016>26014>26013>26011>26010<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26037>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26025>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26009<26008>26007>26006,>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011<26008>26007>26006;NS=43;LV=0

Decomposed and normalized VCF:

chr1	591437	>26051>26006	TAGAAGGAATAAGACGGGCCGGGTGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGAAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCATGGTGGTGGGCGCCTGTAGTCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGAGCGACAGAGTGAGACTCTGTCTCAAAAAAAAAAAAAAAAA	T	.	PASS	AT=>26051>26050>26048>26047>26045>26044>26042>26041>26039>26038>26036>26035>26033>26032>26030>26029>26027>26026>26024>26023>26021>26020>26018>26017>26015>26014>26012>26011>26007>26006,>26051>26006;ID=chr1-591441-DEL->26051>26006-313
chr1	591452	>26051>26006	G	C	.	PASS	AT=>26050>26048>26047,>26050<26049>26047;ID=chr1-591452-SNV->26050<26049>26047-1
chr1	591460	>26051>26006	T	C	.	PASS	AT=>26047>26045>26044,>26047>26046>26044;ID=chr1-591460-SNV->26047>26046>26044-1
chr1	591463	>26051>26006	G	A	.	PASS	AT=>26044>26042>26041,>26044>26043>26041;ID=chr1-591463-SNV->26044>26043>26041-1
chr1	591478	>26051>26006	T	G	.	PASS	AT=>26041>26039>26038,>26041>26040>26038;ID=chr1-591478-SNV->26041>26040>26038-1
chr1	591505	>26051>26006	T	C	.	PASS	AT=>26038>26036>26035,>26038>26037>26035;ID=chr1-591505-SNV->26038>26037>26035-1
chr1	591590	>26051>26006	T	C	.	PASS	AT=>26035>26033>26032,>26035>26034>26032;ID=chr1-591590-SNV->26035>26034>26032-1
chr1	591626	>26051>26006	T	C	.	PASS	AT=>26032>26030>26029,>26032>26031>26029;ID=chr1-591626-SNV->26032>26031>26029-1
chr1	591650	>26051>26006	C	T	.	PASS	AT=>26029>26027>26026,>26029>26028>26026;ID=chr1-591650-SNV->26029>26028>26026-1
chr1	591659	>26051>26006	G	A	.	PASS	AT=>26026>26024>26023,>26026>26025>26023;ID=chr1-591659-SNV->26026>26025>26023-1
chr1	591689	>26051>26006	C	G	.	PASS	AT=>26023>26021>26020,>26023>26022>26020;ID=chr1-591689-SNV->26023>26022>26020-1
chr1	591696	>26051>26006	T	C	.	PASS	AT=>26020>26018>26017,>26020>26019>26017;ID=chr1-591696-SNV->26020>26019>26017-1
chr1	591710	>26051>26006	A	G	.	PASS	AT=>26017>26015>26014,>26017>26016>26014;ID=chr1-591710-SNV->26017>26016>26014-1
chr1	591728	>26051>26006	T	C	.	PASS	AT=>26014>26012>26011,>26014>26013>26011;ID=chr1-591728-SNV->26014>26013>26011-1
chr1	591733	>26051>26006	C	CA	.	PASS	AT=>26011>26007,>26011<26008>26007;ID=chr1-591733-INS->26011<26008>26007-1
chr1	591733	>26051>26006	C	CAA	.	PASS	AT=>26011>26007,>26011<26009<26008>26007;ID=chr1-591733-INS->26011<26009<26008>26007-2
chr1	591733	>26051>26006	C	CAAAAAAAAAAAAAAAAAAAA	.	PASS	AT=>26011>26007,>26011>26010<26009<26008>26007;ID=chr1-591733-INS->26011>26010<26009<26008>26007-20

For a better example, you can refer to this input test.vcf.gz.txt and collapse output test_collapse.vcf.gz.txt (I removed the genotypes to make it smaller). Using default parameters, truvari works well for most SVs:

2024-09-04 10:49:40,002 [INFO] Zipped 1982 variants Counter({'base': 1982})
2024-09-04 10:49:40,002 [INFO] 1 chunks of 1982 variants Counter({'base': 998, '__filtered': 984})
2024-09-04 10:49:44,732 [INFO] Wrote 1212 Variants
2024-09-04 10:49:44,732 [INFO] 770 variants collapsed into 104 variants
2024-09-04 10:49:44,732 [INFO] 807 samples' FORMAT fields consolidated
2024-09-04 10:49:44,732 [INFO] Finished collapse

However, it cannot handle some complex SVs. For example, there are 36 SVs have REF/ALT > 50 bp but SVLEN<50bp. Those SVs are skipped by truvari due to small SVLEN. For such SVs, do you think it would be better to let SVLEN = len(REF)?

@ACEnglish
Copy link
Owner

If you've split multi-allelics, then your SVLEN should reflect the length of the single, bi-allelic variant. Perhaps splitting caused some problems and therefore you should run truvari anno svinfo to update the SVLEN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants