Skip to content

Han-Cao/collapse-bubble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

collapse-bubble

SV merging for pangenome VCF generated by minigraph-cactus pipeline. This tool merges similar SVs within the same bubble, such as highly polymorphic insertions (INS) at the same position.

Note: I will consolidate the entire pipeline into a single script once I have time. For now, please follow the step-by-step instructions.

TODO: A new SV merging pipeline that concatenates overlapping variants before SV merging will be developed. The new pipeline will also merge SVs within the same tandem repeats across different bubbles.

Overview

This pipeline uses Truvari's API to merge SVs and genotypes. As compared to truvari collapse, it is optimized for pangenome VCF:

  1. Only SVs within the same bubble are compared for merging.
  2. When merging genotypes, the phase of genotypes is checked and retained:
    • 0|0 and 1|0: consistent, merge into 1|0
    • 0|1 and 1|0: consistent, merge into 1|1
    • 1|0 and 1|0: inconsistent, don't merge if found in any sample
  3. For SVs with large REF and ALT alleles, the size is determined by REFLEN (length of REF) instead of SVLEN (length difference between REF and ALT). This approach works better for large inversions or complex SVs. For example, a 100bp inversion and a 1000bp inversion have the same SVLEN but different REFLEN.

Dependency

All scripts have been tested with Python 3.10. Please install the following python modules to run the scripts.

(Note: the script was developed with truvari 4.2.2. Currently, it does not work with the new API introduced in truvari 5.)

truvari==4.2.2
pysam
pandas
numpy

We also use the following tools to process VCFs:

  • bcftools: The +fill-tags plugin is used. Please export BCFTOOLS_PLUGINS to configure it.
  • vcfwave: Please use vcfwave v1.0.12 or later, as earlier versions may output incorrect genotypes for some variants.

Input

  • Multiallelic graph VCF: Generated by the minigraph-cactus pipeline and processed by vcfbub. The ID field of this VCF should contain the unique bubble ID, e.g., <73488<73517. This is the default output VCF of the minigraph-cactus pipeline.
  • Reference genome FASTA file: Used for normalization.

Analysis pipeline

1. Preprocessing:

The preprocessing step generates a VCF file that meets the following requirements:

  • Decomposed by vcfwave.
  • Biallelic and sorted.
  • All variants have unique IDs.
  • Only variants from the same bubble share identical INFO/BUBBLE_ID.
  • AC, AN, and AF have been updated based on the genotypes.

If you already have such a VCF, it can be used directly without preprocessing. To start with the multiallelic graph VCF, perform the following steps:

  1. Split multiallelic records into biallelic and update AC, AN, AF using bcftools.
  2. Annotate variants' bubble ID and generate unique variant IDs using annotate_var_id.py.
  3. Decompose the VCF using vcfwave.
  4. Left-align and sort using bcftools.
# suppose the name of input VCF is "mc.vcf.gz"

# split into biallelic
bcftools norm -m -any mc.vcf.gz -Oz -o mc.biallele.vcf.gz

# annotate VCF and assign unique variant ID
python scripts/annotate_var_id.py \
-i mc.biallele.vcf.gz \
-o mc.biallele.uniq_id.vcf.gz

# optional: first drop INFO/AT (useless after decomposition, suggested by cactus)
bcftools annotate -x INFO/AT -Ou mc.biallele.uniq_id.vcf.gz | \
vcfwave -I 1000 | bgzip -c > mc.biallele.uniq_id.vcfwave.vcf.gz

# normalization and sort, update AC/AN/AF
bcftools norm -f ref.fa mc.biallele.uniq_id.vcfwave.vcf.gz | \
bcftools +fill-tags -- -t AC,AN,AF | \
bcftools sort --max-mem 4G -Oz -o mc.biallele.uniq_id.vcfwave.sort.vcf.gz

annotate_var_id.py:

This script assign unique IDs in format of [BUBBLE_ID].[TYPE].[No.] to each variants. The original variant ID (i.e., bubble ID) is stored in INFO/BUBBLE_ID.

usage: annotate_var_id.py [-h] -i VCF -o VCF

options:
  -i VCF, --input VCF   Input VCF
  -o VCF, --output VCF  Output VCF

2. SV merging

To perform SV merging, run collapse_bubble.py:

python collapse_bubble.py \
-i mc.biallele.uniq_id.vcfwave.sort.vcf.gz \
-o merged.vcf.gz \
--map merged.mapping.txt

This script uses Truvari API to merge SVs have the same INFO/BUBBLE_ID, it output 2 files:

1. VCF:

The output VCF is not sorted and has merged variants and genotypes. Similar SVs are merged into the one with the highest MAF. The following fields of the merged SV are added or modified:

  • INFO/ID_LIST: comma-separated list of variants merged into this variants
  • INFO/TYPE: type of variant (SNP, MNP, INS, DEL, INV, COMPLEX)
  • INFO/REFLEN: len(ref)
  • INFO/SVLEN: len(alt) - len(ref)
  • FORMAT/GT: merged genotypes

For example:

#input:
chr1  10039  >123>456.INS.1  A  ATTTTTT  AC=2;AN=6;AF=0.333;BUBBLE_ID=>123>456  0|1  1|0  0|0
chr1  10039  >123>456.INS.2  A  ATTTTTG  AC=3;AN=6;AF=0.500;BUBBLE_ID=>123>456  1|0  0|1  1|0

#output
chr1  10039  >123>456.INS.2  A  ATTTTTG  AC=3;AN=6;AF=0.500;BUBBLE_ID=>123>456;ID_LIST=>123>456.INS.1;TYPE=INS;REFLEN=1;SVLEN=6  1|1  1|1  1|0

2. SV merging table:

A TSV file mapping original SVs (Variant_ID) to merged SVs (Collapse_ID). For example:

CHROM   POS      Bubble_ID   Variant_ID       Collapse_ID
chr1    10039    >123>456    >123>456.INS.1   >123>456.INS.2
chr1    10039    >123>456    >123>456.INS.2   >123>456.INS.2

Additionally, VCF INFO fields can be included as columns by specifying --info. If --info SVLEN is used, the output SVLEN in the tsv file will be REFLEN for COMPLEX and INV to indicate the value used for comparison. While in the output VCF, INFO/SVLEN is always set as len(alt) - len(ref) for all SVs.

Arguments:

usage: collapse_bubble.py [-h] -i VCF -o VCF -m TSV [--chr CHR] [--info TAG] [-l 50] [-r 100] [-p 0.9] [-P 0.9] [-O 0.9]

Input / Output arguments:
  -i VCF, --invcf VCF   Input VCF
  -o VCF, --outvcf VCF  Output VCF
  -m TSV, --map TSV     Write SV mapping table to this file. Default: None
  --chr CHR             chromosome to work on. Default: all
  --info TAG            Comma-separated INFO/TAG list to include in the output map. Default: None

Collapse arguments:
  -l 50, --min-len 50   Minimum allele length of variants to be included, 
                        defined as max(len(alt), len(ref)). Default: 50
  -r 100, --refdist 100
                        Max reference location distance. Default: 100
  -p 0.9, --pctseq 0.9  Min percent sequence similarity (REF for DEL, ALT for other SVs). Default: 0.9
  -P 0.9, --pctsize 0.9
                        Min percent size similarity (SVLEN for INS, DEL; REFLEN for INV, COMPLEX). Default: 0.9
  -O 0.9, --pctovl 0.9  Min pct reciprocal overlap. Default: 0.9

3. Merge overlapping variants

After variant decomposition and left alignment, the VCF contains overlapping variants at the same position (mostly SNPs and INDELs). For example:

chr1   100   var1   C   G     1|0
chr1   100   var2   C   G     0|1
chr1   100   var3   C   CAA   1|0
chr1   100   var4   C   CAA   1|0

In this example:

  • var1 and var2 are duplicates, as they share the same POS, REF, and ALT. This is due to variant decomposition.
  • var1 and var3/var4 overlap on the first haplotype, as there are three alternative alleles at the same POS. This is caused by left alignment.

merge_duplicates.py can clean up duplicated and overlapping variants:

python scripts/merge_duplicates.py \
-i merged.vcf.gz \
-o merged.dedup.vcf.gz \
-c repeat
  • It first concatenates overlapping tandem repeats using the algorithm described in the documentation. For example, var3 and var4 are concatenated into C CAAAA.
  • After concatenating all overlapping variants, it merges duplicated variants into a single record and also updates the phased genotypes.

Output:

chr1   100   var1   C   G     1|1
chr1   100   chr1:100_0   C   CAAAA 1|0

Warning: merge_duplicates.py can increase the polymorphism of tandem repeats when there are many samples. Using --max-repeat 50 can prevent SVs from being concatenated. For accurate quantification of tandem repeat lengths, it is better to run merge_duplicates.py -c repeat without any SV merging.

Note: merge_duplicates.py can concatenate any overlapping variants when -c position is used. This method reconstructs the local haplotype, significantly increasing polymorphism, which is not suitable for SV merging. Additionally, it has stricter requirements for the input VCF. Please carefully read the documentation before using it. It is worth noting that merge_duplicates.py -c position is included in the latest minigraph-cactus pipeline. If you are interested in generating a VCF that reconstruct all overlapping variants, please try the latest MC pipeline instead.

Arguments:

usage: merge_duplicates.py [-h] -i VCF -o VCF [-c {position,repeat,none}] [-m MAX_REPEAT] [-t {ID,AT}] [--merge-mis-as-ref] [--keep-order] [--debug]

Merge duplicated variants in phased VCF

options:
  -h, --help            show this help message and exit
  -i VCF, --invcf VCF   Input VCF, sorted and phased
  -o VCF, --outvcf VCF  Output VCF
  -c {position,repeat,none}, --concat {position,repeat,none}
                        Concatenate variants when they have identical "position" (default) or "repeat" motif, "none" to skip
  -m MAX_REPEAT, --max-repeat MAX_REPEAT
                        Maximum size a variant to search for repeat motif (default: None)
  -t {ID,AT}, --track {ID,AT}
                        Track how variants are merged by "ID", "AT", or disable (default)
  --merge-mis-as-ref    Convert missing to ref when merging missing genotypes with non-missing genotypes
  --keep-order          keep the order of variants in the input VCF (default: sort by chr, pos, alleles)
  --debug               Debug mode

Acknowledgement

  • We thank Adam English for Truvari and for sharing the insightful script that inspired collapse_bubble.py.
  • We thank Glenn Hickey for valuable suggestions on merging overlapping variants.

About

Merge structural variations (SV) for pangenome VCF

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages