GitHub - Yasas1994/vcat: A tool for adding taxonomic annotations to virus contigs, mapping reads to virus genomes and much more.

            XXXX       XXXX
          XXXXXXXXXXXXXXXXXXX
         X    XXXXXXXXXXX    X
       XX      XXXXXXXXX      XX
     XX         XXXXXXX          X                           _
    XXXXXXX     XXXXXXX         XXX          __   _____ __ _| |_
    XXXXXXXXXX   XXXXX     XXXXXXXX          \ \ / / __/ _` | __|
     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX            \ V / (_| (_| | |_
     XXXXXXXXXX76636174XXXXXXXXXXX             \_/ \___\__,_|\__|
     XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    XXXXXXXX     XXXXX   XXXXXXXXX  
    XXX         XXXXXXX     XXXXXXX
     X          XXXXXXX         XX
       XX      XXXXXXXXX     X X
         X    XXXXXXXXXXX    X
          XXXXXXXXXXXXXXXXXXX
            XXXX       XXXX

The virus contig annotation Tool (vcat) is a straightforward, homology-based application designed to provide taxonomic annotations for viral contigs, mapping reads to virus contigs and much more.

Database Generation Pipeline

The pipeline automates the creation of comprehensive viral nucleotide, protein, and profile databases. The database generation process includes the following steps:

ICTV Metadata Parsing and Genome Download: VCAT parses the ICTV Metadata Resource and downloads all viral genomes and proteins from NCBI GenBank.
Custom Taxonomy Creation: A custom taxdump for the ICTV taxonomy is generated using taxonkit.
Nucleotide and Protein Database Construction: Custom nucleotide and protein databases are created using mmseqs2, incorporating the ICTV-specific taxdump.
Protein Clustering and Profile Database Creation: Viral proteins are clustered, and these clusters are converted into a protein profile database.

Contig Annotation Pipeline

vcat annotates viral contigs by sequentially comparing query sequences against the nucleotide, protein, and profile databases uisng mmseqs2. The annotation process follows these steps:

Nucleotide Database Comparison: Query sequences are first compared to the nucleotide database. If matches with >95% identity and >85% query coverage are found, the lineage of the target sequence is assigned to the query sequence.

Here, pseudo ANI is defined as follows. Let, $C$ be a unknown virus contig that needs to be taxonomically annotated. Let, $G_{db}$ be a database of known virus genomes, in our case: ICTV genomes. We search $C$ agaist $G_{db}$ to retrieve all matches for $C$ in $G_{db}$. Let, $M_{db}$ be the list of genomic regions $C$ map to. $M_{db}$ may contain hits from several taxonoically related genomes and $M_{{db}_t}$ represents all regions matching to a taxonomic rank. We calculate the total length of alignmnets to each taxonomic rank and the similarity of these alignments to $C$ using,

$$\text{pseudoANI} = \frac{\sum M_{\text{db},t} \times \text{identity}}{\sum M_{\text{db},t} \times \text{alignment length}}$$

Protein Database Comparison: For query sequences with no significant matches at the nucleotide level, open reading frames (ORFs) within the query sequence are extracted and compared to the viral protein database.

Here, pseudo AAI is defined as follows. extending the definitions from the above section, Let $P_{c}$ be a list of genes on $C$. $P_{db}$ is the protein database of ICTV viruses. $P_{db}$ is searched for matches for ORFs found on $C$ i.e. $C_{ORF}$. $P_m \in P_c$ represents all ORFs with a match. $I_m$ is the aminnno acid identity of proteins in $P_m$.

$$ pseudo\text{ }AAI = \frac{1}{n}\sum_{i}^{n}I_{m_{i}}$$

Profile Database Comparison: Sequences that remain unannotated at the protein level are then compared to the protein profile database to provide taxonomic annotations.

Here, pseudo API calculation is simlar to the AAI calculation above. The main difference is the target database is a protein profile database instead of protein database.

how to install?

git clone https://github.com/Yasas1994/vcat.git
cd vcat

# create a conda env
 mamba env create -f environment.yml

# install vcat pipeline
pip install .

# test the installation
vcat --help

Usage: vcat [OPTIONS] COMMAND [ARGS]...

  vcat: a command-line tool-kit for adding ICTV taxonomy annotations to virus
  contigs, mapping reads to virus genomes and much more.
  (https://github.com/Yasas1994/vcat)

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  contigs    run contig annotation workflow
  preparedb  download and build reference databases
  reads      run read annotation workflow
  utils      tool chain for calculating ani, aai and visualizations

downloading and preparing the databases

vcat preparedb -d databases

running contig annotation pipeline

vcat contigs -i <input>.fasta -o outdir

results can be found in the results directory within the ouput directory

.
├── logs
├── nuc
│   ├── input_genome_ani.tsv
│   └── input_genome.m8
├── prof
│   ├── input_fasta_prof_api.tsv
│   └── input_fasta_prof.m8
├── prot
│   ├── input_fasta.faa
│   ├── input_fasta.gff
│   ├── input_fasta_prot_aai.tsv
│   └── input_fasta_prot.m8
├── results
│   ├── *input_fasta_ictv.csv
│   └── input_fasta.tsv
└── tmp

Expected runtime ?

It takes ~4hrs to run vcat on the ICTV Taxonomy challenge dataset on a laptop computer.

running other workflows

# run vcat read annotation pipeline (comming soon)
vcat reads -i1 <reads1>.fastq [-i2 <reads2.fastq>] -o outdir

# calculate aai of query contigs to ICTV genomes
vcat utils aai  [OPTIONS] -i contigs.m8 -g configs.gff -d [DBDIR]

# calculate ani of query contigs to ICTV genomes
vcat utils ani [OPTIONS] -i contigs.m8

# creating genome comparision plots. i.e query sequence to highly similar ICTV genomes (comming soon)
vcat utils visualize --ani --taxa [taxname] -i contigs.m8 -o outdir

# create phage contig annotation plots (coming soon)
vcat utils visualize --phrogs -i contigs.fasta -o outdir 

# identify provirus (coming soon)
vcat utils provirus -i contigs.fasta -o outdir

Some additional stuff

# to view the contig length distribution of your contigs
seqkit fx2tab -lg {input.fasta} | awk -F "\t" '{print $4}' | tail -n +2 | hist -b 100 -s 10

# to view the length distribution of contigs in the ani calculation output (after apply a filter)
cat testing/gut_jaeger/nuc/gut_jaeger_virus_seqs_fasta_genome_ani.tsv | awk -F "\t" '$8 > 0.1' | awk -F "\t" '{print $3}' | tail -n +2 | hist -b 100 -x

If you use vcat please cite,


[MMSEQS2] MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.
          Steinegger M & Söding J. 2017. Nat. Biotech. 35, 1026–1028. https://doi.org/10.1038/nbt.3988

[PRODIGAL] Prodigal: prokaryotic gene recognition and translation initiation site identification.
           Hyatt et al. 2010. BMC Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119.

[TAXONKIT] TaxonKit: A practical and efficient NCBI taxonomy toolkit.
           Shen, W. & Ren, H. J. 2021. Genet. Genomics https://doi:10.1016/j.jgg.2021.03.006.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
results		results
scripts		scripts
vcat		vcat
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Database Generation Pipeline

Contig Annotation Pipeline

how to install?

downloading and preparing the databases

running contig annotation pipeline

Expected runtime ?

running other workflows

Some additional stuff

About

Releases

Packages

Languages

License

Yasas1994/vcat

Folders and files

Latest commit

History

Repository files navigation

Database Generation Pipeline

Contig Annotation Pipeline

how to install?

downloading and preparing the databases

running contig annotation pipeline

Expected runtime ?

running other workflows

Some additional stuff

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages