Pipeline to create phylogenetic trees for UK and global SARS-CoV-2 sequences and metadata, and publish matched subsets of annotated trees, FASTA sequences and metadata for groups with different access to sensitive data.
Builds trees weekly, with daily updates.
git clone --recurse-submodules https://github.com/virus-evolution/phylopipe.git
cd phylopipe
conda env create -f environment.yml
conda activate phylopipe
- Applies mask to problematic sites in the alignment
mask.txt
- Filters UK sequences, excluding those that have been labelled as duplicate
source_id
, i.e. same patient - Hash non-unique sequences, storing a FASTA with a representative for each unique sequence, and a hashmap from the representative to the now excluded IDs with identical sequence. When an outgroups file is provided, protects the outgroups, and otherwise choses a representative from the most recent
epi-week
to prevent filtering by date downstream - Filter on sample date, keeping all sequences from the last 120 days, keeping specified outgroups, and downsampling the remainder by excluding sequences within 3 mutations from an included sequence
- Split FASTA based on known distinct clades specified in
lineage_splits.csv
- Build tree for each sub-FASTA using
FastTreeMP
and reroot on the clade-specific outgrip - Graft together the subtrees to make a complete tree
- Expand the hashmap, inserting polytomies for each non-unique sequence
- Using
usher
andfaToVcf
, take the filtered aligned FASTA from preprocessing step 2 and construct a mutation annotated tree based on the grafted tree, adding the missing samples in the process where possible
- Sort and collapse short branches < 0.000005
- Annotate tree tips with
country
,lineage
anduk_lineage
- Infer deltrans with ancestral reconstruction and annotate
- Merge and create new
uk_lineages
and annotate - Infer
phylotypes
for UK lineages and annotate
- Publishes subsets of FASTA and metadata CSV with the NEWICK or NEXUS tree as specified in
publish_recipes.json
- Adds new sequences found in preprocessing step 2 to the
usher
mutation annotated tree daily, with the full tree pipeline run weekly.
grapevine
(https://github.com/COG-UK/grapevine) was the name of the original pipeline which preprocessed, aligned and variant called sequences, made phylogenetic trees and more. As the number of sequences has grown the tree building steps take increasingly long to complete. Datapipe
(https://github.com/COG-UK/grapevine_nextflow
) was created to provided daily alignment and metadata processing. This pipeline takes the output of datapipe, constructs trees, annotates and publishes them.