Skip to content

User Guide

Mike Lee edited this page Oct 7, 2024 · 63 revisions

This page serves as the general user guide for GToTree, which may be helpful if you're looking for something specific. To jump right into practical ways GToTree can be helpful it may be more useful to start with the Example-usage page :)


User-Guide Contents


NOTE: Running GToTree with no arguments will provide the help menu.


Required Inputs

The minimum required inputs to GToTree are specifying the genomes you want to incorporate (provided via any combination of NCBI Accessions, GenBank files, and/or nucleotide or amino-acid fasta files) and specifying which single-copy gene-set to use.

Input Genomes

Input genomes can be specified as any combination of NCBI assembly accessions, GenBank files, and/or fasta files.

NCBI Accessions

You can specify which NCBI-archived genomes you'd like to incorporate by providing a single-column file holding NCBI assembly accessions to the -a argument. This file can be created "manually" by searching NCBI's website and downloading a results table, or it can be generated at the command line by using Entrez-Direct – examples of doing both are presented on the examples page here.

  • Those provided can have version numbers (what comes after the "." in the accession, e.g. GCF_000153765.1), or they can be version-less (e.g. GCF_000153765). In the case where no version is provided, GToTree will automatically take the newest released version of that accession.
  • If any of the provided accessions cannot be found at NCBI, they will be printed to the screen at the start of the run and will be reported in the output directory in the file "NCBI_accessions_not_found.txt".
  • An example input accessions file can be found in the GToTree sub-directory here: GToTree/test_data/ncbi_accessions.txt.

GenBank files

To specify which GenBank files to include, you need to provide a single-column file that holds the file names (or paths) to each of the GenBank files you'd like incorporated. This is passed to the -g argument.

  • An example file can be found in the GToTree sub-directory here: GToTree/test_data/genbank_files.txt.

Fasta files

Nucleotide fasta files are provided similarly to the GenBank files, but passed to the -f argument. You need to provide a single-column file that holds the file names (or paths) to each of the fasta files you'd like incorporated.

  • An example file can be found in the GToTree sub-directory here: GToTree/test_data/fasta_files.txt.

Amino-acid files

Amino-acid files can be provided similarly to nucleotide fasta files, but passed to the -A argument. You need to provide a single-column file that holds the file names (or paths) to each of the amino-acid fasta files you'd like incorporated.

  • An example file can be found in the GToTree sub-directory here: GToTree/test_data/amino_acid_files.txt.

Specifying which single-copy gene-set to use

GToTree also needs to know which SCG-set to use – passed with the -H flag. There are 14 provided with the program that are stored in the hmm_sets sub-directory (discussed in some more detail here). If you followed the conda quick-start installation instructions, or set up the appropriate environment variable yourself as detailed here, you can view which HMM files are available by running gtt-hmms by itself (and you don't need to specify the full path to the HMM file, just the name as printed by gtt-hmms, e.g. -H Bacterial.hmm).

Outputs

Each GToTree run creates an output directory to hold all of the output files. This defaults to "GToTree_output", but can be specified with the -o argument, and the names of the files below that include "GToTree_output" would be changed accordingly.

Primary output files

Tree

  • Aligned_SCGs.tre
    • The final tree file in newick format.
    • FastTree reports "local support values" that appear as labels on internal nodes to estimate the reliability of each split in the tree. You can find more information about this at their user page here.
    • IQ-TREE reports ultrafast bootstrap (UFBoot) support values. Their help pages state that values of 95% indicate a 95% probability that clade is true.
    • If run with the -N option, no tree will be produced, and only the alignment will be generated.

Alignment files

  • Aligned_SCGs.faa
    • Alignment file in fasta format.
  • Aligned_SCGs_mod_names.faa (if TaxonKit was used to add lineage info to labels – specified with the -t flag)

Partitions file

  • Partitions.txt
    • A partitions file compatible with treeing programs capable of using different models for each gene. See, for example, iqtree's info here.

Genomes summary info

  • Genomes_summary_info.tsv
    • A tab-delimited table of summary information for each genome including the following columns:
Column Name Contents
1 assembly_id the input assembly ID (either the accession or base file name depending on input source)
2 label the label assigned to the genome in the output tree file
3 label_source where the label came from
4 taxid the NCBI taxid if genome was provided by NCBI accession or GenBank with taxid information
5 num_SCG_hits number of gene hits to the target HMMs
6 uniq_SCG_hits number of unique gene hits to the target HMMs
7 perc_comp estimated percent completion based on the target HMMs
8 perc_redund estimated percent redundancy based on the target HMMs
9 num_SCG_hits_after_len_filt number of gene hits to the target HMMs following length filtering
10 in_final_tree Yes or No, did this genome end up in the final tree

Depending on if taxonomy info is added or not, there may be additional columns with taxonomic info.

SCG-hit counts per genome

  • All_genomes_SCG_hit_counts.tsv
    • A tab-delimited file where the first column holds each genome ID, and the rest of the columns hold counts for how many hits there were to each target gene for each genome.

Report output files

Report files will only be written if they are needed. For instance if a genome is dropped from analysis due to having too few hits to the target genes, the file "Genomes_removed_for_too_few_hits.tsv" will be created. But if no genomes were removed for this reason, the file will not be generated. So you should not expect to find all of these files after any particular run. These will be included in the output sub-directory run_files/.

Redundant_input_accessions.txt

  • If there were duplicate accessions in the input NCBI accessions file, they will be reported here.

NCBI_accessions_not_found.txt

  • If any of the provided NCBI accessions were not found at NCBI, they will be reported here.

NCBI_accessions_not_downloaded.txt

  • If any NCBI accessions were found at NCBI but neither their genes nor genome could be downloaded, they will be reported here.

Genomes_removed_for_too_few_hits.tsv

  • If any genomes were removed from analysis due to having too few hits to the target genes (set with -G argument), they will be listed here along with how many hits they had.

Genes_with_no_hits_to_any_genomes.txt

  • If any genes didn't have hits in any of the input genomes, they will be reported here.

Genbank_files_with_no_CDSs.txt

  • If any GenBank files were provided that didn't have genes annotated, their genes would be called with Prodigal and the genomes retained in the analysis, but they would be reported here just in case this is a cause for a red flag for you (like if you intended to be using only fully annotated GenBank files).

Optional arguments and parameters

The GToTree help menu can be viewed by running GToTree with no arguments.


Output directory

  • [-o <str>] default: GToTree_output

GToTree writes all output files to an output directory. By default this is set to "GToTree_output", but you can specify it by passing an argument to the -o flag. (E.g.: -o Alteromonas_output)


Specify desired genome labels

  • [-m ] specify desired genome labels

Often it is helpful to have specific labels for specific genomes in a tree (as exemplified in the Alteromonas example). GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), if we specify we want NCBI taxonomies added (with the -t flag), and it uses internal scripts to add GTDB taxonomy if specified (with the -D flag). We can also swap labels of specific genomes we know we care about and want to be able to find more easily. We also may want to just append certain information to the label. For example, maybe we want the lineage information added, but we may also know something about specific genomes that we want marked on the tree also (like they all possess a certain gene type we are interested in for some reason and we want to be able to quickly search and highlight them on the tree).

Either or both of these can be done by providing a mapping file to the -m argument. It should be a 2- or 3-column tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name, depending on how the input genome was provided). The second column may or may not be empty. If we want to specify the complete label ourselves for that genome, then we should put that new label in column 2. If we don't want to specify the complete label, leave column 2 empty. Column 3 may or may not be empty. If we'd like to append something to the label (whether that's the initial label, the modified lineage label, or the label we may have specified in column 2), then add that text to column 3. If there is nothing we want to append, we should leave column 3 empty.

NOTE: Not all input genomes need to be provided in the file being passed to -m.


Specify to add NCBI lineage info to genome labels

  • [-t ] default: false

By setting the -t flag, GToTree will: get strain information if it is available for those provided by NCBI accession; get the NCBI taxids for any genomes that possess them (either from the NCBI accessions provided or if they are present in any GenBank files provided) and use TaxonKit to convert them into lineage information; and add this information to the genome labels – making the output tree much more useful than just a collection of odd identifiers. Which specific taxonomic ranks get added can be specified with the -L argument.


Specify to add GTDB lineage info to genome labels

  • [-D ] default: false

By setting the -D flag, GToTree will add GTDB taxonomy information to the labels that appear in the final alignment and in the tree. Which specific taxonomic ranks get added can be specified with the -L argument. If the -D flag and the -t flag are both specified, GTDB taxonomy info will take precedence over NCBI taxonomy info when possible, and if a given accession isn't present in GTDB, the NCBI lineage info will be used (and "_NCBI" will be appended to it).


Specify which taxonomic ranks to add to genome labels

  • [-L ] default: Domain,Phylum,Class,Species,Strain

Provide the -t flag with no arguments in order to add lineage info to the genome labels. By default this will add Domain, Phylum, Class, Species, and strain info, where available. This may be suitable when making a tree across multiple domains, but may be unnecessarily cumbersome when just making a tree of one genus, for instance like shown here in the Alteromonas example. You can specify which ranks you'd like added to the labels with the -L argument as a comma-separated list. For instance, to add all would look like this: -L Domain,Phylum,Class,Order,Family,Genus,Species,Strain.


Filtering gene-hits by length

  • [-c ] default: 0.2

When scanning many genomes for many genes, it becomes harder or completely impractical to visually inspect alignments of everything. One way to try to filter out potential spurious gene hits is to filter by some expected length. The -c parameter uses the median length of each particular gene-set to calculate an upper- and lower-length threshold to filter out potentially spurious genes. It takes float between 0-1 specifying the range about the median of sequences to be retained. The default is 0.2. For example, under the default setting, if the median length of a set of sequences is 100 AAs, those genes with sequences longer than 120 or shorter than 80 will be filtered out before alignment of that gene set. This becomes less useful when using very few genomes however (see note here). By default, this is set to 0.2.


Filtering genomes based on hits to target genes

  • [-G ] default: 0.5

The -G parameter allows you to filter out genomes that have too few hits to the target genes. It takes a float between 0-1 specifying the minimum fraction of hits a genome must have of the SCG-set. The default is 0.5. For example, under the default setting, if there are 100 target genes in the HMM profile, and genome X only has hits to 49 of them, it will be removed from analysis. How you want this set may depend on the breadth of diversity of the tree you are making (see note here.


Number of cpus to use during HMM searches

  • [-n ] default: 2

The number of cpus you'd like to use during the HMM searches.


Number of jobs to run in parallel where possible

  • [-j ] default: 1

This determines how many jobs to run in parallel during steps that are parallelizable – such as the processing/searching of each individual genome, the filtering of genes and genomes, and alignment of each individual gene-set.


Best-hit mode

  • [-B ] default: false

Provide the -B flag with no arguments if you'd like to run GToTree in "best-hit" mode. By default, if a target gene has more than one hit in a given genome, GToTree won't include a sequence for that target gene from that genome in the final alignment. With this flag provided, GToTree will take the best hit and incorporate it into the alignment, even if that genome has more than one hit to the target gene. See here for more discussion on this.


Keep temporary directory

  • [-d ] default: false

Provide the -d flag with no arguments if you'd like to keep the temporary directory that is used during the run. This is mostly useful for debugging purposes.


Use http instead of ftp

  • [-P ] default: false

Provide the -P flag with no arguments if your system can't utilize ftp, and GToTree will use http to try to download any needed files.


Generate nucleotide alignment and tree

  • [-z ] default: false

Provide the -z flag with no arguments if you'd like to make the alignment and tree based on nucleotide sequences instead of amino-acids (which can provide greater resolution on closely related input genomes). Note this mode can only accept NCBI accessions (passed to -a) and genome fasta files (passed to -f) as input sources (since we can't confidently reverse-translate amino-acid seqs). GToTree still finds the target genes based on amino-acid HMM searches.


If you run GToTree with no arguments you can see the help menu:


                                  GToTree v1.8.8
                         (github.com/AstrobioMike/GToTree)


 ----------------------------------  HELP INFO  ---------------------------------- 

  This program takes input genomes from various sources and ultimately produces
  a phylogenomic tree. You can find detailed usage information at:
                                  github.com/AstrobioMike/GToTree/wiki


 -------------------------------  REQUIRED INPUTS  ------------------------------- 

      1) Input genomes in one or any combination of the following formats:
        - [-a <file>] single-column file of NCBI assembly accessions
        - [-g <file>] single-column file with the paths to each GenBank file
        - [-f <file>] single-column file with the paths to each fasta file
        - [-A <file>] single-column file with the paths to each amino acid file,
                      each file should hold the coding sequences for just one genome

      2)  [-H <file>] location of the uncompressed target SCGs HMM file
                      being used, or just the HMM name if the 'GToTree_HMM_dir' env variable
                      is set to the appropriate location (which is done by conda install), run
                      'gtt-hmms' by itself to view the available gene-sets)


 -------------------------------  OPTIONAL INPUTS  ------------------------------- 


      Output directory specification:

        - [-o <str>] default: GToTree_output
                  Specify the desired output directory.


      User-specified modification of genome labels:

        - [-m <file>] mapping file specifying desired genome labels
                  A two- or three-column tab-delimited file where column 1 holds either
                  the file name or NCBI accession of the genome to name (depending
                  on the input source), column 2 holds the desired new genome label,
                  and column 3 holds something to be appended to either initial or
                  modified labels (e.g. useful for "tagging" genomes in the tree based
                  on some characteristic). Columns 2 or 3 can be empty, and the file does
                  not need to include all input genomes.


      Options for adding taxonomy information:

        - [-t ] add NCBI taxonomy; default: false
                  Provide this flag with no arguments if you'd like to add NCBI taxonomy
                  info to the sequence headers for any genomes with NCBI taxids. This will
                  will largely be effective for input genomes provided as NCBI accessions
                  (provided to the `-a` argument), but any input GenBank files will also
                  be searched for an NCBI taxid. See `-L` argument for specifying desired
                  ranks.

        - [-D ] add GTDB taxonomy; default: false
                  Provide this flag with no arguments if you'd like to add taxonomy from the
                  Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
                  effective for input genomes provided as NCBI accessions (provided to the
                  `-a` argument). This can be used in combination with the `-t` flag, in
                  which case any input accessions not represented in the GTDB will have NCBI
                  taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
                  specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
                  for help getting input accessions based on GTDB taxonomy searches.

        - [-L <str>] specify wanted lineage ranks; default: Domain,Phylum,Class,Species,Strain
                  A comma-separated list of the taxonomic ranks you'd like added to
                  the labels if adding taxonomic information. E.g., all would be
                  "-L Domain,Phylum,Class,Order,Family,Genus,Species,Strain". Note that
                  strain-level information is available through NCBI, but not GTDB.


      Filtering settings:

        - [-c <float>] sequence length cutoff; default: 0.2
                  A float between 0-1 specifying the range about the median of
                  sequences to be retained. For example, if the median length of a
                  set of sequences is 100 AAs, those seqs longer than 120 or shorter
                  than 80 will be filtered out before alignment of that gene set
                  with the default 0.2 setting.

        - [-G <float>] genome hits cutoff; default: 0.5
                  A float between 0-1 specifying the minimum fraction of hits a
                  genome must have of the SCG-set. For example, if there are 100
                  target genes in the HMM profile, and Genome X only has hits to 49
                  of them, it will be removed from analysis with default value 0.5.

        - [-B ] best-hit mode; default: false
                  Provide this flag with no arguments if you'd like to run GToTree
                  in "best-hit" mode. By default, if a SCG has more than one hit
                  in a given genome, GToTree won't include a sequence for that target
                  from that genome in the final alignment. With this flag provided,
                  GToTree will use the best hit. See here for more discussion:
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider


      KO searching:

        - [-K <file>] single-column file of KO targets to search each genome for
                  Table of hit counts, fastas of hit sequences, and files compatible
                  with the iToL web-based tree-viewer will be generated for each
                  target. See visualization of gene presence/absence example at
                  github.com/AstrobioMike/GToTree/wiki/example-usage for example.


      Pfam searching:

        - [-p <file>] single-column file of Pfam targets to search each genome for
                  Table of hit counts, fastas of hit sequences, and files compatible
                  with the iToL web-based tree-viewer will be generated for each
                  target. See visualization of gene presence/absence example at
                  github.com/AstrobioMike/GToTree/wiki/example-usage for example.


      General run settings:

        - [-z ] nucleotide mode; default: false
                  Make alignment and/or tree with nucleotide sequences instead of amino-acid
                  sequences. Note this mode can only accept NCBI accessions (passed to `-a`)
                  and genome fasta files (passed to `-f` as input sources. (GToTree still
                  finds target genes based on amino-acid HMM searches.)

        - [-N ] do not make a tree; default: false
                  No tree. Generate alignment only.

        - [-k ] keep individual target gene alignments; default: false
                  Keep individual alignment files.

        - [-T <str>] tree program to use; default: FastTreeMP if available, FastTree if not
                  Which program to use for tree generation. Currently supported are
                  "FastTree", "FastTreeMP", and "IQ-TREE". As of now, these run with
                  default settings only (and IQ-TREE includes "-m MFP" and "-B 1000"). To
                  run either with more specific options (and there is a lot of room for
                  variation here), you can use the output alignment file from GToTree (and
                  the partitions file if wanted for mixed-model specification) as input into
                  a dedicated treeing program.
                  Note on FastTreeMP (http://www.microbesonline.org/fasttree/#OpenMP). FastTreeMP
                  parallelizes some steps of the treeing step. Currently, conda installs
                  FastTreeMP with FastTree on linux systems, but not on Mac OSX systems.
                  So if using the conda installation, you may not have FastTreeMP if on a Mac,
                  in which case FastTree will be used instead - this will be reported when the
                  program starts, and be in the log file.

        - [-n <int> ] num cpus; default: 2
                  The number of cpus you'd like to use during the HMM search. (Given
                  these are individual small searches on single genomes, 2 is probably
                  always sufficient. Keep in mind this will be multiplied by the number of jobs
                  running concurrently if also modifying the `-j` parameter.)

        - [-M <int> ] num muscle threads; default: 5
                  The number of threads muscle will use during alignment. (Keep in mind
                  this will be multiplied by the number of jobs running concurrently
                  if also modifying the `-j` parameter.)

        - [-j ] num jobs; default: 1
                  The number of jobs you'd like to run in parallel during steps
                  that are parallelizable. This includes things like downloading input
                  accession genomes and running parallel alignments, and portions of the
                  tree step if using FastTree on a Linux system (e.g. see FastTree docs
                  here: http://www.microbesonline.org/fasttree/#OpenMP).

                  Note that I've occassionally noticed NCBI not being happy with over ~50
                  downloads being attempted concurrently. So if using a `-j` setting around
                  there or higher, and GToTree is saying a lot of input accessions were not
                  successfully downloaded, consider trying with fewer.

        - [-X ] override super5 alignment; default: false
                  If working with greater than 1,000 target genomes, GToTree will by default
                  use the 'super5' muscle alignment algorithm to increase the speed of the alignments (see
                  github.com/AstrobioMike/GToTree/wiki/things-to-consider#working-with-many-genomes
                  for more details and the note just above there on using representative genomes).
                  Anyway, provide this flag with no arguments if you don't want to speed up
                  the alignments.

        - [-P ] use http instead of ftp; default: false
                  Provide this flag with no arguments if your system can't use ftp,
                  and you'd like to try using http.

        - [-F ] force overwrite; default: false
                  Provide this flag with no arguments if you'd like to force
                  overwriting the output directory if it exists.

        - [-d ] debug mode; default: false
                  Provide this flag with no arguments if you'd like to keep the
                  temporary directory. (Mostly useful for debugging.)


 --------------------------------  EXAMPLE USAGE  -------------------------------- 

        GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4

Options set for programs run

NOTE
In order to give conda more freedom in managing its environments, specific versions have been removed from the conda installation. If you installed with conda, and want to know the specific versions, you can check in the environment with conda list.

prodigal

Prodigal is run with default settings other than setting the -c flag, which means only include complete genes.

hmmsearch

Hmmsearch is run with default settings other than setting the --cut_ga flag, which uses the gathering score stored in the HMM profile being used for cutoff values.

muscle

Muscle is run with default settings using the -align PPP algorithm when working with fewer than 1,000 target genomes. When run with greater than 1,000 input genomes, the -super5 algorithm is used. This can be overridden by adding -X to the GToTree call – see here for more on this.

trimal

Trimal is run with default settings other than setting the -automated1 flag, which performs "a heuristic selection of the automatic method based on similarity statistics. (Optimized for Maximum Likelihood phylogenetic tree reconstruction)."

FastTree

FastTree and FastTreeMP are run with default settings for amino-acid alignments, and with the flags -nt and -gtr for nucleotide alignments.

iq-tree

IQ-TREE is currently run with default settings other than -nt, -bb 1000, and -mset WAG,LG.


Genome completeness and redundancy estimations

Estimated genome completeness is calculated as the number of unique target genes identified divided by the total number of target genes searched (* 100 to make it a percentage). So, if 100 single-copy genes (SCGs) were searched, and 95 of those were found at least 1, the estimated completeness would be 95%.

Estimated genome redundancy is calculated as the number of copies greater than 1 that were identified for all target genes divided by the total number of target HMM genes searched (* 100 to make it a percentage). So, if 100 SCGs were searched, and 1 gene was detected in 2 copies, that would be 1 copy over the target: 1 / 100 * 100 = 1% estimated redundancy. If 1 gene was detected in 11 copies, that would be 10 over the target: 10 / 100 * 100 = 10% estimated redundancy. This would come out to the same estimate if 10 individual genes were detected in 2 copies each. This would still be 10 over the target: 10 / 100 * 100 = 10% estimated redundancy.

Citation information

GToTree relies on many great programs. Along with all other outputs, it will generate a citations.txt file with citation information specific for every run that accounts for all programs it relies upon. Please be sure to cite the developers appropriately :)

Here is an example output citations.txt file from a run, and how I'd cite it in the methods:

GToTree v1.6.31
Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi:10.1093/bioinformatics/btz188

Prodigal v2.6.3
Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223–2230. doi.org/10.1186/1471-2105-11-119

HMMER3 v3.3.2
Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi:10.1371/journal.pcbi.1002195

Muscle v5.1
Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021. doi.org/10.1101/2021.06.20.449169

TrimAl v1.4.rev15
Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972–1973. doi:10.1093/bioinformatics/btp348

TaxonKit v0.9.0
Shen W and Ren H. TaxonKit: a practical and efficient NCBI Taxonomy toolkit. Journal of Genetics and Genomics. 2021. doi.org/10.1016/j.jgg.2021.03.006

FastTree 2 v2.1.11
Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi:10.1371/journal.pone.0009490

Example methods text based on above citation output (be sure to modify as appropriate for your run)

The archaeal phylogenomic tree was produced with GToTree v1.6.31 (Lee 2019), using the prepackaged single-copy gene-set for archaea (76 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes on input genomes provided as fasta files. Target genes were identified with HMMER3 v3.2.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). TaxonKit (Shen and Ren 2021) was used to connect full lineages to taxonomic IDs.

Clone this wiki locally