Skip to content

gi-bielefeld/sans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SANS ambages

Symmetric Alignment-free phylogeNomic Splits
--- phylogenomics with Abundance-filter, Multi-threading and Bootstrapping on Amino-acid or GEnomic Sequences

  • Reference-free
  • Alignment-free
  • Input: assembled genomes / reads, or coding sequences / amino acid sequences
  • Output: phylogenetic splits or tree
  • NEW: Abundance-filter
  • NEW: Bootstrapping
  • NEW: Multi-threading
  • NEW: Labeled PDF and Nexus output
  • NEW: Better performance by handling singleton k-mers separately
  • NEW: Output core k-mers

Dos and Don'ts

  • The genomes should not be too diverged. SANS works well on species level.
  • Be careful with outliers and outgroups (for the reason above).
  • The sequences should not be too short. Provide whole-genome data or as many coding sequences as possible.
  • Be careful with viruses (for the reasons above).
  • Have a look at the network (weakly compatible or 2-tree). It does not make much sense to extract a tree, if the split network is a hairball.
  • Reconstructed phylogenies are unrooted, even though a Newick file (-N) suggests a root.
  • In case of problems, contact us (see below).

Publications

Rempel, A., Wittler, R.: SANS serif: alignment-free, whole-genome based phylogenetic reconstruction. Bioinformatics. (2021).

Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology. 15: 4 (2020).

Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. In: Huber, K. and Gusfield, D. (eds.) Proceedings of WABI 2019. LIPIcs. 143, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2019).

Table of Contents

Requirements

For the main program, there are no strict dependencies other than C++ version 14.
To read in compressed fasta/fastq files, it could be necessary to install zlib:

sudo apt install libz-dev

Optional:

  • To read in a colored de Bruijn graph, SANS uses the API of Bifrost.
  • To convert the output into NEXUS format, the provided script requires Python 3.
  • To visualize the splits, we recommend the tool SplitsTree (version 4).

Windows is currently supported with full basic functionality but limited features only. Confer to branch "windows".

Compilation

git clone https://gitlab.ub.uni-bielefeld.de/gi/sans.git
cd sans
make

By default, the installation creates:

  • a binary (SANS)

You may want to make the binary (SANS) accessible via your PATH variable.

Optional: If Bifrost should be used, change the SANS makefile accordingly (easy to see how). Please note the installation instructions regarding the default maximum k-mer size of Bifrost in its README. If during the compilation, the Bifrost library files are not found, make sure that the corresponding folder is found as include path by the C++ compiler. You may have to add -I/usr/local/include (with the corresponding folder) to the compiler flags in the makefile. We also recommend to have a look at the FAQs of Bifrost.

Usage

Use SANS --help to obtain a detailed list of options.

Input files

Specify your input by -i <list> where <list> is either a file-of-files or in kmtricks format. Each file can be in fasta, multiple fasta or fastq format.

  • File-of-files:
    genome_a.fa
    genome_b.fa
    ...
    
    Files can be in subfolders and/or compressed:
    dataset_1/genome_a.fa.gz
    dataset_1/genome_b.fa.gz
    ...
    
    One genome can also be composed of several files (the first one will be used as identifier in the output):
    reads_a_forward.fa reads_a_reverse.fa
    genome_b_chr_1.fa genome_b_chr_2.fa
    ...
    
  • kmtricks format: In this format, you can specify individual identifiers and, optionally, abundance thresholds (see "read data as input"):
    genome_A : reads_a_forward.fa ; reads_a_reverse.fa ! 2
    genome_B : genome_b_chr_1.fa ; genome_b_chr_2.fa ! 1
    ...
    

Input paramters

  • genomes/assemblies as input: just use -i <list>
  • read data as input: to filter out k-mers of low abundance, either use -q 2 (or higher thresholds) to specify a global threshold for all input files, or use the kmtricks file-of-files format to specify (individual) thresholds.
  • mix of assemblies and read data as input: use the kmtricks file-of-files format to specify individual thresholds.
  • coding sequences as input: add -a if input is provided as translated sequences, or add -c if translation is required. See usage information (SANS --help) for further details.

Output

  • The output file, specified by -o <split-file>, lists all splits line by line, sorted by their weight in a tab-separated format where the first column is the split weight and the remaining entries are the identifiers of genomes split from the others.
  • For large data sets, the list of splits can become very long. We recommend to restrict the output for n genomes as input to the 10n strongest splits in the output using -t 10n.
  • We recommend to filter the splits using -f <filter>. Then, the sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.

If you want a tree, use -f strict. In this case, -N <newick-file> can be used to write the resulting tree into a newick file; instead or additionally to -o <split-file>. If you want a network, use one of the following filters:

  • weakly: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of "weak compatibility").
  • 2-tree: two sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible. 3-tree: three sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible; if not to the third if possible.

To visualize the splits, we recommend the tool SplitsTree (version 4). Make it accessible via your PATH variabble to enable the following options.

To generate SANS output readable for SplitsTree, use option -X <nexus-file>. In SplitsTree, after opening the <nexus-file>, select "Draw" > "EqualAngle" > "Apply". To produce a PDF using SplitsTree, use option -p <pdf-file>.

To depict the phylogeny on a higher level, taxa can be assigned to groups. Each group is then represented by a color and individual text labels of taxa are replaced by colored circles accordingly. Use option -l <groups.tsv> to provide a mapping of some or all genome identifiers to arbitrary group names. Optionally, provide a second file <colors.tsv> to provide a mapping of group names to custom RGB values, e.g. 255 0 0 for red.

Further parameters

  • To observe the progress of SANS during computation, use -v to switch to verbose mode.
  • You may want to try different values for the k-mer length using -k <integer>. On shorter sequences, e.g. virus data, use a smaller k, e.g., -k 11.
  • If your input contains 'N's or other ambiguous IUPAC characters, affected k-mers are skipped by default. Option -x <small_integer> can be used to replace these with the corresponding DNA or AA bases, considering all possibilities.
  • By default, all available threads are used for parallel processing. The number of threads can be limited by -T <integer>.

Bootstrapping To assess the robustness of reconstructed splits with respect to noise in the input data, bootstrap replicates can be constructed by randomly varying the observed k-mer content. To compare the originally determined splits to, e.g., 1000 bootstrap replicates, use -b 1000. An additional output file <split-file>.bootstrap containing the bootstrap support values will be created. To include them in the nexus file for visualization, use scripts/sans2conf_nexus.py <split-file> <split-file>.bootstrap <list> > <nexus-file>.

To generate a consensus tree from bootstrapped trees, use -f tree -b 1000 -C. To generate a consensus network from bootstrapped trees, use -f tree -b 1000 -C weakly. To filter out low support splits, add a threshold using -b 1000 0.75.

See usage information (SANS --help) for further options.

Examples

  1. Split network from assemblies

    SANS -i list.txt -o sans.splits - X sans.nexus -t 10n -f weakly
    

    Tree in newick format from assemblies

    SANS -i list.txt -N sans.new -f strict
    

    Split network (and tree) from read data

    SANS -i list.txt -o sans.splits -X sans.nexus -t 10n -f weakly -q 2
    (SANS -i list.txt -s sans.splits -f strict -N sans.new)
    
  2. Drosophila example data

    # go to example directory
    cd example_data/drosophila
    
    # download data: whole genome (or coding sequences)
    ./download_WG.sh
    (./download_CDS.sh)
    
    # compute splits
    ../../SANS -i WG_list.kmt -o WG_weakly.splits -f weakly -v
    (../../SANS -i CDS_list.kmt -o CDS_weakly.splits -f weakly -v -c)
    
    # generate PDF (if SplitsTree installed)
    ../../SANS -i WG_list.kmt -s WG_weakly.splits -p WG_weakly.pdf
    (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -p CDS_weakly.pdf)
    
    # generate labeled PDF (if SplitsTree installed)
    ../../SANS -i WG_list.kmt -s WG_weakly.splits -l groups.tsv -p WG_weakly_groups.pdf
    (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -l groups.tsv -p CDS_weakly_groups.pdf)
    
    # filter for tree
    ../../SANS -i WG_list.kmt -s WG_weakly.splits -N WG.new -f strict
    (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -N CDS.new -f strict)
    
    # generate consensus network from bootstrapped trees
    ../../SANS -i WG_list.kmt -f strict -b 1000 -C weakly -p WG_weakly.pdf -v
    (../../SANS -i CDS_list.kmt -f strict -b 1000 -C weakly -p CDS_weakly.pdf -v)
    
    
    Example network
  3. Virus example data

    # go to example directory
    cd example_data/prasinoviruses
    
    # download data
    ./download.sh
    
    # compute splits
    ../../SANS -i list.txt -o weakly.splits -f weakly -k 11 -v 
    
    # compute splits and generate PDF (if SplitsTree installed)
    ../../SANS -i list.txt -p weakly.pdf -f weakly -k 11 -v 
    
    Example network

Performance evaluation on predicted open reading frames

SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.

If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.

If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.

The following tools have been used with the parameters shown in the table.

Tool Parameters Reference
SANS -k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)]
Getorf (EMBOSS) -find (see below) -t 11 Gary Williams, 2000
ORFfinder -n true -g 11 NCBI
Prodigal -q V2.6.3,Doug Hyatt, 2016

For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.

Measure
F1-score harmonic mean of precision and recall
precision (number of called splits that are also in reference)
/ (total number of called splits)
recall (number of reference splits that are also called)
/ (total number of reference splits)
weighted precision (total weight of called splits that are also in reference)
/ (total weight of all called splits)
weighted recall (total weight of reference splits that are also called)
/ (total weight of all reference splits)

Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.

  • Salmonella enterica Para C, 220 genomes, k=31
  • Salmonella enterica subspecies enterica, 2964 genomes, k=21
Preprocessing Para C
F1-Score
Para C
weighted F1
Enterica
F1-Score
Enterica
weighted F1
none (whole genome) 0.878 0.999 0.587 0.792
Getorf (-find 0) 0.881 0.999 0.624 0.807
Getorf (-find 1) 0.868 0.999 0.620 0.799
ORFfinder 0.858 0.997 0.594 0.766
Prodigal 0.853 0.998 0.587 0.792

Clustering / dereplication of metagenome assembled genomes (MAGs)

For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster. This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.

The general procedure is:

# reconstruct a tree
SANS --input <list_of_files> --newick <tree_to_cluster.new> --filter strict --kmer 15 (--verbose) --window W --top T (see below)

# determine clusters from tree
scripts/newick2clusters.py <tree_to_cluster.new> > <clusters.tsv>

Due to the usually very high number of input sequences, we recommend the usage of parameters --window (-w) and --top (-t) in order to save time and memory. (The experimental parameter --window is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)

Setting Parameters
quick --window 25 --top 50n
thoroughly --window 10 --top 100n

The tree is chopped into clusters as follows:

  • Re-root tree to maximum degree node
  • In post order traversal:
    • ignore non-branching node (merge edges)
    • get clusters from sub-trees (recursively)
    • if edge longer than parent edge:
      • remove found clusters from current leaf set
      • remaining leaf set =: new cluster

Dereplication efficiency and accuracy

This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.

Filter # MAGs # Strains
MIMAG medium 5,786 686
MIMAG high 2,510 349
no filter 11,602 791

For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:

purity := TP / (TP + FP)

completeness := TP / (TP + FN)

These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings. (The running times are out-dated!)

Input Setting Running time Memory average purity
(weighted)
average completeness
(weighted)
MIMAG medium quick 11h 56G 0.973
(0.956)
0.884
(0.998)
thoroughly 50h 130G 0.972
(0.952)
0.881
(0.998)
MIMAG high quick 2h 16G 0.983
(0.978)
0.890
(0.991)
thoroughly 6h 36G 0.983
(0.979)
0.912
(0.993)
no filter quick 59h 127G 0.996
(0.983)
0.173
(0.668)
thoroughly 185h 290G 0.995
(0.979)
0.190
(0.700)

Contact

For any question, feedback, or problem, please feel free to file an issue on this Git repository or write an email and we will get back to you as soon as possible.

pangenomics-service@cebitec.uni-bielefeld.de

SANS is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of SANS by completing this very short survey.

License

Privacy

We use the open source software Matomo for web analysis in order to collect anonymized usage statistics for this repository. Please refer to our Privacy Notice for details.