Symmetric Alignment-free phylogeNomic Splits
--- phylogenomics with Abundance-filter, Multi-threading and Bootstrapping on Amino-acid or GEnomic Sequences
- Reference-free
- Alignment-free
- Input: assembled genomes / reads, or coding sequences / amino acid sequences
- Output: phylogenetic splits or tree
- NEW: Abundance-filter
- NEW: Bootstrapping
- NEW: Multi-threading
- NEW: Labeled PDF and Nexus output
- NEW: Better performance by handling singleton k-mers separately
- NEW: Output core k-mers
- The genomes should not be too diverged. SANS works well on species level.
- Be careful with outliers and outgroups (for the reason above).
- The sequences should not be too short. Provide whole-genome data or as many coding sequences as possible.
- Be careful with viruses (for the reasons above).
- Have a look at the network (weakly compatible or 2-tree). It does not make much sense to extract a tree, if the split network is a hairball.
- Reconstructed phylogenies are unrooted, even though a Newick file (-N) suggests a root.
- In case of problems, contact us (see below).
Rempel, A., Wittler, R.: SANS serif: alignment-free, whole-genome based phylogenetic reconstruction. Bioinformatics. (2021).
Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology. 15: 4 (2020).
Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. In: Huber, K. and Gusfield, D. (eds.) Proceedings of WABI 2019. LIPIcs. 143, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2019).
- Requirements
- Compilation
- Usage
- Examples
- Performance evaluation on predicted open reading frames
- Clustering / dereplication of metagenome assembled genomes (MAGs)
- Contact
- License
- Privacy
For the main program, there are no strict dependencies other than C++ version 14.
To read in compressed fasta/fastq files, it could be necessary to install zlib:
sudo apt install libz-dev
Optional:
- To read in a colored de Bruijn graph, SANS uses the API of Bifrost.
- To convert the output into NEXUS format, the provided script requires Python 3.
- To visualize the splits, we recommend the tool SplitsTree (version 4).
Windows is currently supported with full basic functionality but limited features only. Confer to branch "windows".
git clone https://gitlab.ub.uni-bielefeld.de/gi/sans.git
cd sans
make
By default, the installation creates:
- a binary (SANS)
You may want to make the binary (SANS) accessible via your PATH variable.
Optional: If Bifrost should be used, change the SANS makefile accordingly (easy to see how). Please note the installation instructions regarding the default maximum k-mer size of Bifrost in its README. If during the compilation, the Bifrost library files are not found, make sure that the corresponding folder is found as include path by the C++ compiler. You may have to add -I/usr/local/include
(with the corresponding folder) to the compiler flags in the makefile. We also recommend to have a look at the FAQs of Bifrost.
Use SANS --help
to obtain a detailed list of options.
Input files
Specify your input by -i <list>
where <list>
is either a file-of-files or in kmtricks format. Each file can be in fasta, multiple fasta or fastq format.
- File-of-files:
Files can be in subfolders and/or compressed:
genome_a.fa genome_b.fa ...
One genome can also be composed of several files (the first one will be used as identifier in the output):dataset_1/genome_a.fa.gz dataset_1/genome_b.fa.gz ...
reads_a_forward.fa reads_a_reverse.fa genome_b_chr_1.fa genome_b_chr_2.fa ...
- kmtricks format:
In this format, you can specify individual identifiers and, optionally, abundance thresholds (see "read data as input"):
genome_A : reads_a_forward.fa ; reads_a_reverse.fa ! 2 genome_B : genome_b_chr_1.fa ; genome_b_chr_2.fa ! 1 ...
Input paramters
- genomes/assemblies as input: just use
-i <list>
- read data as input: to filter out k-mers of low abundance, either use
-q 2
(or higher thresholds) to specify a global threshold for all input files, or use the kmtricks file-of-files format to specify (individual) thresholds. - mix of assemblies and read data as input: use the kmtricks file-of-files format to specify individual thresholds.
- coding sequences as input: add
-a
if input is provided as translated sequences, or add-c
if translation is required. See usage information (SANS --help
) for further details.
Output
- The output file, specified by
-o <split-file>
, lists all splits line by line, sorted by their weight in a tab-separated format where the first column is the split weight and the remaining entries are the identifiers of genomes split from the others. - For large data sets, the list of splits can become very long. We recommend to restrict the output for n genomes as input to the 10n strongest splits in the output using
-t 10n
. - We recommend to filter the splits using
-f <filter>
. Then, the sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.
If you want a tree, use -f strict
. In this case, -N <newick-file>
can be used to write the resulting tree into a newick file; instead or additionally to -o <split-file>
.
If you want a network, use one of the following filters:
weakly
: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of "weak compatibility").2-tree
: two sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible.3-tree
: three sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible; if not to the third if possible.
To visualize the splits, we recommend the tool SplitsTree (version 4). Make it accessible via your PATH variabble to enable the following options.
To generate SANS output readable for SplitsTree, use option -X <nexus-file>
. In SplitsTree, after opening the <nexus-file>
, select "Draw" > "EqualAngle" > "Apply". To produce a PDF using SplitsTree, use option -p <pdf-file>
.
To depict the phylogeny on a higher level, taxa can be assigned to groups. Each group is then represented by a color and individual text labels of taxa are replaced by colored circles accordingly. Use option -l <groups.tsv>
to provide a mapping of some or all genome identifiers to arbitrary group names. Optionally, provide a second file <colors.tsv>
to provide a mapping of group names to custom RGB values, e.g. 255 0 0
for red.
Further parameters
- To observe the progress of SANS during computation, use
-v
to switch to verbose mode. - You may want to try different values for the k-mer length using
-k <integer>
. On shorter sequences, e.g. virus data, use a smaller k, e.g.,-k 11
. - If your input contains 'N's or other ambiguous IUPAC characters, affected k-mers are skipped by default. Option
-x <small_integer>
can be used to replace these with the corresponding DNA or AA bases, considering all possibilities. - By default, all available threads are used for parallel processing. The number of threads can be limited by
-T <integer>
.
Bootstrapping
To assess the robustness of reconstructed splits with respect to noise in the input data, bootstrap replicates can be constructed by randomly varying the observed k-mer content. To compare the originally determined splits to, e.g., 1000 bootstrap replicates, use -b 1000
. An additional output file <split-file>.bootstrap
containing the bootstrap support values will be created. To include them in the nexus file for visualization, use scripts/sans2conf_nexus.py <split-file> <split-file>.bootstrap <list> > <nexus-file>
.
To generate a consensus tree from bootstrapped trees, use -f tree -b 1000 -C
.
To generate a consensus network from bootstrapped trees, use -f tree -b 1000 -C weakly
.
To filter out low support splits, add a threshold using -b 1000 0.75
.
See usage information (SANS --help
) for further options.
-
Split network from assemblies
SANS -i list.txt -o sans.splits - X sans.nexus -t 10n -f weakly
Tree in newick format from assemblies
SANS -i list.txt -N sans.new -f strict
Split network (and tree) from read data
SANS -i list.txt -o sans.splits -X sans.nexus -t 10n -f weakly -q 2 (SANS -i list.txt -s sans.splits -f strict -N sans.new)
-
Drosophila example data
# go to example directory cd example_data/drosophila # download data: whole genome (or coding sequences) ./download_WG.sh (./download_CDS.sh) # compute splits ../../SANS -i WG_list.kmt -o WG_weakly.splits -f weakly -v (../../SANS -i CDS_list.kmt -o CDS_weakly.splits -f weakly -v -c) # generate PDF (if SplitsTree installed) ../../SANS -i WG_list.kmt -s WG_weakly.splits -p WG_weakly.pdf (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -p CDS_weakly.pdf) # generate labeled PDF (if SplitsTree installed) ../../SANS -i WG_list.kmt -s WG_weakly.splits -l groups.tsv -p WG_weakly_groups.pdf (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -l groups.tsv -p CDS_weakly_groups.pdf) # filter for tree ../../SANS -i WG_list.kmt -s WG_weakly.splits -N WG.new -f strict (../../SANS -i CDS_list.kmt -s CDS_weakly.splits -N CDS.new -f strict) # generate consensus network from bootstrapped trees ../../SANS -i WG_list.kmt -f strict -b 1000 -C weakly -p WG_weakly.pdf -v (../../SANS -i CDS_list.kmt -f strict -b 1000 -C weakly -p CDS_weakly.pdf -v)
-
Virus example data
# go to example directory cd example_data/prasinoviruses # download data ./download.sh # compute splits ../../SANS -i list.txt -o weakly.splits -f weakly -k 11 -v # compute splits and generate PDF (if SplitsTree installed) ../../SANS -i list.txt -p weakly.pdf -f weakly -k 11 -v
SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.
If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.
If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.
The following tools have been used with the parameters shown in the table.
Tool | Parameters | Reference |
---|---|---|
SANS | -k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)] | |
Getorf (EMBOSS) | -find (see below) -t 11 | Gary Williams, 2000 |
ORFfinder | -n true -g 11 | NCBI |
Prodigal | -q | V2.6.3,Doug Hyatt, 2016 |
For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.
Measure | |
---|---|
F1-score | harmonic mean of precision and recall |
precision | (number of called splits that are also in reference) / (total number of called splits) |
recall | (number of reference splits that are also called) / (total number of reference splits) |
weighted precision | (total weight of called splits that are also in reference) / (total weight of all called splits) |
weighted recall | (total weight of reference splits that are also called) / (total weight of all reference splits) |
Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.
- Salmonella enterica Para C, 220 genomes, k=31
- Salmonella enterica subspecies enterica, 2964 genomes, k=21
Preprocessing | Para C F1-Score |
Para C weighted F1 |
Enterica F1-Score |
Enterica weighted F1 |
---|---|---|---|---|
none (whole genome) | 0.878 | 0.999 | 0.587 | 0.792 |
Getorf (-find 0) | 0.881 | 0.999 | 0.624 | 0.807 |
Getorf (-find 1) | 0.868 | 0.999 | 0.620 | 0.799 |
ORFfinder | 0.858 | 0.997 | 0.594 | 0.766 |
Prodigal | 0.853 | 0.998 | 0.587 | 0.792 |
For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster. This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.
The general procedure is:
# reconstruct a tree
SANS --input <list_of_files> --newick <tree_to_cluster.new> --filter strict --kmer 15 (--verbose) --window W --top T (see below)
# determine clusters from tree
scripts/newick2clusters.py <tree_to_cluster.new> > <clusters.tsv>
Due to the usually very high number of input sequences, we recommend the usage of parameters --window
(-w
) and --top
(-t
) in order to save time and memory. (The experimental parameter --window
is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)
Setting | Parameters |
---|---|
quick | --window 25 --top 50n |
thoroughly | --window 10 --top 100n |
The tree is chopped into clusters as follows:
- Re-root tree to maximum degree node
- In post order traversal:
- ignore non-branching node (merge edges)
- get clusters from sub-trees (recursively)
- if edge longer than parent edge:
- remove found clusters from current leaf set
- remaining leaf set =: new cluster
This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.
Filter | # MAGs | # Strains |
---|---|---|
MIMAG medium | 5,786 | 686 |
MIMAG high | 2,510 | 349 |
no filter | 11,602 | 791 |
For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:
purity := TP / (TP + FP)
completeness := TP / (TP + FN)
These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings. (The running times are out-dated!)
Input | Setting | Running time | Memory | average purity (weighted) |
average completeness (weighted) |
---|---|---|---|---|---|
MIMAG medium | quick | 11h | 56G | 0.973 (0.956) |
0.884 (0.998) |
thoroughly | 50h | 130G | 0.972 (0.952) |
0.881 (0.998) |
|
MIMAG high | quick | 2h | 16G | 0.983 (0.978) |
0.890 (0.991) |
thoroughly | 6h | 36G | 0.983 (0.979) |
0.912 (0.993) |
|
no filter | quick | 59h | 127G | 0.996 (0.983) |
0.173 (0.668) |
thoroughly | 185h | 290G | 0.995 (0.979) |
0.190 (0.700) |
For any question, feedback, or problem, please feel free to file an issue on this Git repository or write an email and we will get back to you as soon as possible.
pangenomics-service@cebitec.uni-bielefeld.de
SANS is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of SANS by completing this very short survey.
- The sparse-map library is licensed under the MIT license.
- The Bifrost library is licensed under the BSD-2 license.
- SANS uses gzstream, licensed under the LGPL license.
- SANS is licensed under the GNU general public license.
We use the open source software Matomo for web analysis in order to collect anonymized usage statistics for this repository. Please refer to our Privacy Notice for details.