Skip to content

III. Workflows overview

Benjamin Linard edited this page May 1, 2020 · 3 revisions

PEWO compiles a set of workflows dedicated to the evaluation of phylogenetic placement algorithms and their software implementation. It focuses on reporting placement accuracy under different conditions and associated computational costs.

Workflows are exploited in different evaluation procedures . Some procedures were initially described in some placement manuscript but without a public implementation. Some other procedures were developed in the context of PEWO.

Pruning-based accuracy evaluation (PAC)

This standard procedure was introduced with EPA (Berger et al, 2010) and reused in PPlacer and RAPPAS original manuscripts (Matsen et al, 2011 ; Linard et al, 2019).

It aims to measure 2 topological metrics:

  • Node Distance (ND) : evaluates placement accuracy by measuring, for each query sequence, the number of reference tree nodes separating the branch associated to the expected placement and the observed placement.
  • expected Node Distance (eND) : An improved version of ND, which takes into account placement weights (e.g. Likelihood Weight Ratios, found in jplace files).

Briefly, in each pruning the reference tree is pruned by random selection of an internal node and removal of the corresponding subtree. Pruned leaves are used as query sequences, and it is expected that they place on the pruned branch where was initially attached the pruned subtree. For each query, the expected placement is this latter branch and the observed placement is the actual placement. From these 2 branches ND and eND can be computed.

Below is a detailed example of the Snakemake workflow that is generated in tutorials, example 2.

The workflow associated to this procedure can be visualised with the help of Snakemake (using the --dag option). The analysis detailed in example 2 is restricted to 2 prunings and default software parameters to generate the following graph:

Workflow operations can be grouped into 4 categories:

  • Prunings : The pruning themselves.
  • Placements : Various operations necessary to the placement of pruned leaves to the pruned trees, including pre-placement requirements (query to reference alignments, phylo-kmer database build...). Subgraph structures related to the tested software are highlighted in red. For instance, here alignment-based methods have in common alignment-related operations. Each block of placement operations is repeated for each pruning (in blue).
  • Metrics : Metric computations based on the jplace outputs of the different software for the different tested parameters. Produces output tsv tables outputs (see tutorials).
  • Plots : Dynamic plot generation. Produces svg output images (see tutorials).

Likelihood-based accuracy evaluation (LAC)

This procedure was developed for PEWO and proposes a rapid evaluation of phylogenetic placements. It is mostly designed for developers and as a faster alternative for evaluation of small changes in the code/algorithms (compared to the heavier PAC procedure).

Briefly, a set of query reads can be placed. For each placement (each query), the resulting jplace is read and a new tree aggregating reference tree + query is built and reoptimized under the same model of evolution. Likelihood gain or loss are then reported.

Below is a detailed example of the workflow that is generated in tutorials, example 6. To display a smaller graph (using the --dag option of Sankemake), the analysis is restricted to a single query.

Workflow operations can be grouped into 3 categories:

  • Placements : Various operations necessary to the placement of pruned leaves to the pruned trees, including pre-placement requirements (query to reference alignments, phylo-kmer database build...). Subgraph structures related to the tested software are highlighted in red. For instance, here alignment-based methods have in common alignment-related operations. Each block of placement operations is repeated for each pruning (in blue).
  • Likelihood computations : Tree aggregations and reoptimisations. Produces output tsv tables outputs (see tutorials).
  • Plots : Dynamic plot generation. Produces svg output images (see tutorials).

Resources evaluation (RES)

CPU and peek RAM consumption are measured for operations that are compulsory to phylogenetic placement, which includes alignment in alignment-based methods and ancestral state reconstruction + database build in alignment-free methods. This procedure mostly intend to evaluate the scalability of the methods, as punctual analyses or routine placement of large sequence volumes do not induce the same constraints. Resource consumption is measured via the benchmark tools integrated in Snakemake.

Below is a detailed example of the workflow that is generated in tutorials, example 5. To display a smaller graph (using the --dag option of Sankemake), the analysis is restricted to only 2 parameter combinations for each tested software.

Workflow operations can be grouped into 2 categories:

  • Placements : Various operations necessary to the placement of pruned leaves to the pruned trees, including pre-placement requirements (query to reference alignments, phylo-kmer database build...). Subgraph structures related to the tested software are highlighted in red. For instance, here alignment-based methods have in common alignment-related operations. Each block of placement operations is repeated for each pruning (in blue).
  • Plots & metrics : Produces output tsv tables outputs using Sakemake benhcmark tools and dynamically generate plots from the results (see tutorials).

Plots focuses on the successive operations necessary to placements, which are different in alignment-based and alignment-free methods and how this difference impacts resources necessary to the analysis of a few samples or many samples on a daily basis. More precisely, the plot times for 1000 samples are computed as follows :

  • RAPPAS (alignment-free) : 1 ancestral reconstruction (ansrec) + 1 database build (rappas-dbbuild) + 1000 placements (rappas-placement)
  • alignment-based methods : 1000 alignments (hmmalign) + 1000 placements (epa, epang, pplacer or apples)