SV merging for pangenome VCF generated by minigraph-cactus
pipeline. This tool merges similar SVs within the same bubble, such as highly polymorphic insertions (INS) at the same position.
Note: I will consolidate the entire pipeline into a single script once I have time. For now, please follow the step-by-step instructions.
TODO: A new SV merging pipeline that concatenates overlapping variants before SV merging will be developed. The new pipeline will also merge SVs within the same tandem repeats across different bubbles.
This pipeline uses Truvari's API to merge SVs and genotypes. As compared to truvari collapse
, it is optimized for pangenome VCF:
- Only SVs within the same bubble are compared for merging.
- When merging genotypes, the phase of genotypes is checked and retained:
- 0|0 and 1|0: consistent, merge into 1|0
- 0|1 and 1|0: consistent, merge into 1|1
- 1|0 and 1|0: inconsistent, don't merge if found in any sample
- For SVs with large REF and ALT alleles, the size is determined by
REFLEN
(length of REF) instead ofSVLEN
(length difference between REF and ALT). This approach works better for large inversions or complex SVs. For example, a 100bp inversion and a 1000bp inversion have the sameSVLEN
but differentREFLEN
.
All scripts have been tested with Python 3.10. Please install the following python modules to run the scripts.
(Note: the script was developed with truvari 4.2.2
. Currently, it does not work with the new API introduced in truvari 5
.)
truvari==4.2.2
pysam
pandas
numpy
We also use the following tools to process VCFs:
- bcftools: The
+fill-tags
plugin is used. Please exportBCFTOOLS_PLUGINS
to configure it. - vcfwave: Please use
vcfwave
v1.0.12 or later, as earlier versions may output incorrect genotypes for some variants.
- Multiallelic graph VCF: Generated by the minigraph-cactus pipeline and processed by vcfbub. The ID field of this VCF should contain the unique bubble ID, e.g.,
<73488<73517
. This is the default output VCF of theminigraph-cactus
pipeline. - Reference genome FASTA file: Used for normalization.
The preprocessing step generates a VCF file that meets the following requirements:
- Decomposed by
vcfwave
. - Biallelic and sorted.
- All variants have unique IDs.
- Only variants from the same bubble share identical
INFO/BUBBLE_ID
. - AC, AN, and AF have been updated based on the genotypes.
If you already have such a VCF, it can be used directly without preprocessing. To start with the multiallelic graph VCF, perform the following steps:
- Split multiallelic records into biallelic and update AC, AN, AF using
bcftools
. - Annotate variants' bubble ID and generate unique variant IDs using
annotate_var_id.py
. - Decompose the VCF using
vcfwave
. - Left-align and sort using
bcftools
.
# suppose the name of input VCF is "mc.vcf.gz"
# split into biallelic
bcftools norm -m -any mc.vcf.gz -Oz -o mc.biallele.vcf.gz
# annotate VCF and assign unique variant ID
python scripts/annotate_var_id.py \
-i mc.biallele.vcf.gz \
-o mc.biallele.uniq_id.vcf.gz
# optional: first drop INFO/AT (useless after decomposition, suggested by cactus)
bcftools annotate -x INFO/AT -Ou mc.biallele.uniq_id.vcf.gz | \
vcfwave -I 1000 | bgzip -c > mc.biallele.uniq_id.vcfwave.vcf.gz
# normalization and sort, update AC/AN/AF
bcftools norm -f ref.fa mc.biallele.uniq_id.vcfwave.vcf.gz | \
bcftools +fill-tags -- -t AC,AN,AF | \
bcftools sort --max-mem 4G -Oz -o mc.biallele.uniq_id.vcfwave.sort.vcf.gz
annotate_var_id.py
:
This script assign unique IDs in format of [BUBBLE_ID].[TYPE].[No.] to each variants. The original variant ID (i.e., bubble ID) is stored in INFO/BUBBLE_ID
.
usage: annotate_var_id.py [-h] -i VCF -o VCF
options:
-i VCF, --input VCF Input VCF
-o VCF, --output VCF Output VCF
To perform SV merging, run collapse_bubble.py
:
python collapse_bubble.py \
-i mc.biallele.uniq_id.vcfwave.sort.vcf.gz \
-o merged.vcf.gz \
--map merged.mapping.txt
This script uses Truvari
API to merge SVs have the same INFO/BUBBLE_ID
, it output 2 files:
1. VCF:
The output VCF is not sorted and has merged variants and genotypes. Similar SVs are merged into the one with the highest MAF
. The following fields of the merged SV are added or modified:
INFO/ID_LIST
: comma-separated list of variants merged into this variantsINFO/TYPE
: type of variant (SNP, MNP, INS, DEL, INV, COMPLEX)INFO/REFLEN
: len(ref)INFO/SVLEN
: len(alt) - len(ref)FORMAT/GT
: merged genotypes
For example:
#input:
chr1 10039 >123>456.INS.1 A ATTTTTT AC=2;AN=6;AF=0.333;BUBBLE_ID=>123>456 0|1 1|0 0|0
chr1 10039 >123>456.INS.2 A ATTTTTG AC=3;AN=6;AF=0.500;BUBBLE_ID=>123>456 1|0 0|1 1|0
#output
chr1 10039 >123>456.INS.2 A ATTTTTG AC=3;AN=6;AF=0.500;BUBBLE_ID=>123>456;ID_LIST=>123>456.INS.1;TYPE=INS;REFLEN=1;SVLEN=6 1|1 1|1 1|0
2. SV merging table:
A TSV file mapping original SVs (Variant_ID
) to merged SVs (Collapse_ID
). For example:
CHROM POS Bubble_ID Variant_ID Collapse_ID
chr1 10039 >123>456 >123>456.INS.1 >123>456.INS.2
chr1 10039 >123>456 >123>456.INS.2 >123>456.INS.2
Additionally, VCF INFO fields can be included as columns by specifying --info
. If --info SVLEN
is used, the output SVLEN in the tsv file will be REFLEN for COMPLEX and INV to indicate the value used for comparison. While in the output VCF, INFO/SVLEN is always set as len(alt) - len(ref)
for all SVs.
Arguments:
usage: collapse_bubble.py [-h] -i VCF -o VCF -m TSV [--chr CHR] [--info TAG] [-l 50] [-r 100] [-p 0.9] [-P 0.9] [-O 0.9]
Input / Output arguments:
-i VCF, --invcf VCF Input VCF
-o VCF, --outvcf VCF Output VCF
-m TSV, --map TSV Write SV mapping table to this file. Default: None
--chr CHR chromosome to work on. Default: all
--info TAG Comma-separated INFO/TAG list to include in the output map. Default: None
Collapse arguments:
-l 50, --min-len 50 Minimum allele length of variants to be included,
defined as max(len(alt), len(ref)). Default: 50
-r 100, --refdist 100
Max reference location distance. Default: 100
-p 0.9, --pctseq 0.9 Min percent sequence similarity (REF for DEL, ALT for other SVs). Default: 0.9
-P 0.9, --pctsize 0.9
Min percent size similarity (SVLEN for INS, DEL; REFLEN for INV, COMPLEX). Default: 0.9
-O 0.9, --pctovl 0.9 Min pct reciprocal overlap. Default: 0.9
After variant decomposition and left alignment, the VCF contains overlapping variants at the same position (mostly SNPs and INDELs). For example:
chr1 100 var1 C G 1|0
chr1 100 var2 C G 0|1
chr1 100 var3 C CAA 1|0
chr1 100 var4 C CAA 1|0
In this example:
var1
andvar2
are duplicates, as they share the samePOS
,REF
, andALT
. This is due to variant decomposition.var1
andvar3
/var4
overlap on the first haplotype, as there are three alternative alleles at the samePOS
. This is caused by left alignment.
merge_duplicates.py
can clean up duplicated and overlapping variants:
python scripts/merge_duplicates.py \
-i merged.vcf.gz \
-o merged.dedup.vcf.gz \
-c repeat
- It first concatenates overlapping tandem repeats using the algorithm described in the documentation. For example,
var3
andvar4
are concatenated intoC CAAAA
. - After concatenating all overlapping variants, it merges duplicated variants into a single record and also updates the phased genotypes.
Output:
chr1 100 var1 C G 1|1
chr1 100 chr1:100_0 C CAAAA 1|0
Warning: merge_duplicates.py
can increase the polymorphism of tandem repeats when there are many samples. Using --max-repeat 50
can prevent SVs from being concatenated. For accurate quantification of tandem repeat lengths, it is better to run merge_duplicates.py -c repeat
without any SV merging.
Note: merge_duplicates.py
can concatenate any overlapping variants when -c position
is used. This method reconstructs the local haplotype, significantly increasing polymorphism, which is not suitable for SV merging. Additionally, it has stricter requirements for the input VCF. Please carefully read the documentation before using it. It is worth noting that merge_duplicates.py -c position
is included in the latest minigraph-cactus pipeline. If you are interested in generating a VCF that reconstruct all overlapping variants, please try the latest MC pipeline instead.
Arguments:
usage: merge_duplicates.py [-h] -i VCF -o VCF [-c {position,repeat,none}] [-m MAX_REPEAT] [-t {ID,AT}] [--merge-mis-as-ref] [--keep-order] [--debug]
Merge duplicated variants in phased VCF
options:
-h, --help show this help message and exit
-i VCF, --invcf VCF Input VCF, sorted and phased
-o VCF, --outvcf VCF Output VCF
-c {position,repeat,none}, --concat {position,repeat,none}
Concatenate variants when they have identical "position" (default) or "repeat" motif, "none" to skip
-m MAX_REPEAT, --max-repeat MAX_REPEAT
Maximum size a variant to search for repeat motif (default: None)
-t {ID,AT}, --track {ID,AT}
Track how variants are merged by "ID", "AT", or disable (default)
--merge-mis-as-ref Convert missing to ref when merging missing genotypes with non-missing genotypes
--keep-order keep the order of variants in the input VCF (default: sort by chr, pos, alleles)
--debug Debug mode