Skip to content

Things to consider

Mike Lee edited this page Jul 28, 2022 · 24 revisions

An important caveat on the idea of a “workflow for phylogenomics”

Phylogenetics is an incredibly complicated and well-researched field, and things become even more complicated when working with many concatenated genes as is the case with phylogenomics. GToTree is meant to be a relatively high-throughput, user-friendly, and reproducible workflow, something I believe is useful due to the high volumes of sequencing data and genomes we are often working with these days. But anything designed this way needs to inherently sacrifice something in terms of flexibility, options, precision, etc. It is important that users new to this arena understand that many things impact the outcome of a phylogenetic/genomic analysis, particularly including the alignment algorithm used, and the model and program used for tree construction. Currently, GToTree employs only one alignment tool, and two options for tree construction. Users can also take the concatenated alignment output by GToTree (and the partitions file if they'd like) and use that with many other tree construction tools. But please keep in mind that phylogenetic analysis is complicated, and no one program or tool is an "absolute answer" or the "truth" – another way to think of this is with the old adage "all models are wrong".


When to use GToTree and when not?

GToTree is very useful in many situations, but not all. For example, it is useful if you want to make a large-scale Tree of Life spanning all 3 domains including a lot of genomes (like demonstrated here). And it is useful if you want to infer evolutionary relationships between some newly recovered genomes and references on a smaller scale (like the Alteromonas example here). But even if you use a specific marker-gene set to make a tree of all the organisms of interest (like the Gammaproteobacteria set in the Alteromonas example), this is only useful at the level of resolution those marker genes provide. Often that may be enough for your purposes, but sometimes you might need or want to go deeper. In cases like this, you may want to use GToTree to figure out where your new genomes fit in with, say, 500 reference genomes, and then you could use that tree to identify which reference genomes you actually want to include in a pangenomic analysis with your new genomes.

Is GToTree useful for assigning taxonomy?

No, GToTree is not a tool for taxonomic assignment. For assigning taxonomy to genomes there are dedicated programs that work very well like the Genome Taxonomy Database Toolkit (GTDB-Tk). I would typically assign taxonomy to my new genomes with GTDB-Tk, and then I would use that information to figure out which reference genomes I'd want to include in a de novo phylogenomic tree I'd build with GToTree. You can however make a de novo tree, look at where your new genomes fall on that tree, and see which references they are more closely related to based on that tree. For example, with the Alteromonas example here, we start there "knowing" the new genome is an Alteromonas, and we build a de novo tree with all RefSeq reference Alteromonas genomes and our new one. Ahead of where that example starts, I may have figured out that my new genome was an Altermonas by using GTDB-Tk.

Should GToTree be used for estimating genome/MAG/bin quality?

No. While GToTree does calculate some rough idea of estimated completion/redundancy based on the SCG-set being used, this is a very rudimentary view, and just provided because it's basically free information based on what we're looking for. For estimating genome/MAG/bin quality, I would recommend using CheckM or, better yet, CheckM2 👍


Consider using "representative" genomes

It is not often that we want to cover a large breadth of diversity while also needing multiple very closely related genomes in our tree. If we are trying to cover a large breadth of diversity, it can be very helpful to use "representative" reference genomes – a slimmed set of manually and computationally derived genomes that are designed to capture the breadth of microbial diversity by choosing representatives in genomic lineage space. I regularly make use of both NCBI's representative genomes and GTDB's species representatives.

For example, if we were to try to pull reference genomes to build a tree of the Staphylococcaceae family, at the time of putting this together this would be 15,871 genomes in NCBI's RefSeq:

  # pulling accessions from NCBI
esearch -query '"Staphylococcaceae"[ORGN] AND "latest refseq"[filter]' -db assembly | esummary | \
        xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > staphylococcaceae-refseq-accs.txt

wc -l staphylococcaceae-refseq-accs.txt
      # 15871 staphylococcaceae-refseq-accs.txt

This is generally way too many to practically make a phylogenomic tree out of. But if we limit to the search to NCBI representative genomes only, it cuts them down to 83:

esearch -query '"Staphylococcaceae"[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter]' -db assembly | \
        esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession \
        > staphylococcaceae-refseq-representative-accs.txt

wc -l staphylococcaceae-refseq-representative-accs.txt
      # 83 staphylococcaceae-refseq-representative-accs.txt

Which is much more manageable, while still covering the breadth of diversity within the Staphylococcaceae family, just with fewer closely related genomes.

Looking for the same at GTDB, we can see there are a total of 12,904 genomes present within their Staphylococcaceae collection:

gtt-get-accessions-from-GTDB -t "Staphylococcaceae" --get-taxon-counts

#  Reading in the GTDB info table...
#    Using GTDB v95: Released July 17, 2020

#    The rank 'family' has 12904 Staphylococcaceae entries.

While there are 80 representative ones they've deemed as species clusters:

gtt-get-accessions-from-GTDB -t "Staphylococcaceae" --get-taxon-counts --GTDB-representatives-only

#  Reading in the GTDB info table...
#    Using GTDB v95: Released July 17, 2020

#    The rank 'family' has 12904 Staphylococcaceae entries.

#  In considering only GTDB representative genomes:

#    The rank 'family' has 80 Staphylococcaceae representative genome entries.

When working on a broad scale, using representative genomes like this can often allow us to cover the diversity we want without getting bogged down with an intractable amount of highly closely related genomes 🙂

Subsetting the GTDB further

If the above isn't enough, we can also consider using the GToTree helper program gtt-subset-GTDB-accessions on the output table that was produced with gtt-get-accessions-from-GTDB. I often use it to get one representative genome of each Order when making a tree spanning an entire Domain, e.g.:

# searching GTDB for all representative archaea
gtt-get-accessions-from-GTDB -t Archaea --GTDB-representatives-only

# randomly choosing 1 representative per Order
gtt-subset-GTDB-accessions -i GTDB-Archaea-domain-GTDB-rep-metadata.tsv --get-Order-representatives-only

The above cuts the number of reference genomes to include drastically, and one per Order is generally fine if showing where our new MAGs fit into a Domain-level tree in my experience.


Working with many genomes

When working with many thousands of genomes, the time the individual gene alignments take can quickly become prohibitive. First, I'd suggest considering working with representative genomes as discussed above, if not doing so already. But if we are targeting greater than 1,000 genomes, GToTree by default will use the super5 algorithm of muscle to help speed up the alignments. A message will pop up notifying you when the run starts, but if you'd like to use the regular PPP muscle algorithm still despite having many sequences to align, then pass the -X option with no arguments to the GToTree call.


Filtering hits by gene-length

The default setting for this value (set with -c) is 0.2. This means if the median length of all genes selected as best hits to marker-gene A is 100, genes that were hits to marker-gene A that are greater in length than 120 or shorter in length than 80 will be removed from the analysis. This seems to work well in my experience, but only when there are enough genes in the gene set to give somewhat of a representative distribution of the lengths of genes that exist within that target gene set. Meaning, at the extreme end, if we only had 3 genes to consider, and their lengths were 100, 100, and 121, the 121-length gene would be filtered out, but maybe it shouldn't be. If running GToTree with very few genomes, you might consider increasing this threshold and/or visually inspecting some of the alignments.


Filtering genomes by fraction of hits to targets

The default setting for this value (set with -G) is set to 0.5, meaning that if you are searching for 100 genes, genomes with hits to less than 50 will be dropped from the analysis. This seems to me to be reasonable when creating a tree than spans a lot of diversity, like all 3 domains, but you may want to increase this threshold when working with a more closely related set of organisms.


Best-hit mode

By default, if a given genome has more than one hit to a specific HMM profile (target gene), GToTree won't include a sequence for that target gene from that genome in the final concatenated alignment (it will insert a gap-sequence just as would be the case if that genome had 0 hits to the target). This is a conservative way to go, because if there are multiple copies of a target SCG present within a genome, the copies may not all be under the same evolutionary pressures, and which one we choose may impact the alignment and tree in ways we do not want it to. So I figure in general, being conservative is better for default settings. But if you'd like, you can specify the -B flag with no arguments to tell GToTree to run in "best-hit" mode. In this case, when a given genome has more than one hit to a specific target gene, GToTree will take the best hit and add it to the alignment.


Are eukaryotic genomes appropriate for use with GToTree?

If only using highly conserved ribosomal proteins, (like those in the Tree of Life example using the Hug et al. 2016 SCG-set), and/or if all genes are already identified (e.g. the input source is an NCBI accession with gene calls or a GenBank file with gene calls), then GToTree is suitable for working with Eukaryotes in addition to Bacteria and Archaea. If no gene-calls are available, then GToTree is likely not suitable for eukaryotic genomes as the only gene-caller currently implemented is prodigal. Also keep in mind the estimated completion and redundancy metrics are probably less meaningful for Eukaryotes, so I wouldn't pay much attention to them.

Clone this wiki locally