-
Notifications
You must be signed in to change notification settings - Fork 26
User Guide
This page serves as the general user guide for GToTree, which may be helpful if you're looking for something specific. To jump right into practical ways GToTree can be helpful it may be more useful to start with the Example-usage page :)
- Required inputs
- Outputs
- Optional arguments and parameters
- Options set for programs run
- Genome completeness and redundancy estimations
- All programs used citation info
NOTE: Running
GToTree
with no arguments will provide the help menu.
The minimum required inputs to GToTree are specifying the genomes you want to incorporate (provided via any combination of NCBI Accessions, GenBank files, and/or nucleotide or amino-acid fasta files) and specifying which single-copy gene-set to use.
Input genomes can be specified as any combination of NCBI assembly accessions, GenBank files, and/or fasta files.
You can specify which NCBI-archived genomes you'd like to incorporate by providing a single-column file holding NCBI assembly accessions to the -a
argument. This file can be created "manually" by searching NCBI's website and downloading a results table, or it can be generated at the command line by using Entrez-Direct – examples of doing both are presented on the examples page here.
- Those provided can have version numbers (what comes after the "." in the accession, e.g. GCF_000153765.1), or they can be version-less (e.g. GCF_000153765). In the case where no version is provided, GToTree will automatically take the newest released version of that accession.
- If any of the provided accessions cannot be found at NCBI, they will be printed to the screen at the start of the run and will be reported in the output directory in the file "NCBI_accessions_not_found.txt".
- An example input accessions file can be found in the GToTree sub-directory here:
GToTree/test_data/ncbi_accessions.txt
.
To specify which GenBank files to include, you need to provide a single-column file that holds the file names (or paths) to each of the GenBank files you'd like incorporated. This is passed to the -g
argument.
- An example file can be found in the GToTree sub-directory here:
GToTree/test_data/genbank_files.txt
.
Nucleotide fasta files are provided similarly to the GenBank files, but passed to the -f
argument. You need to provide a single-column file that holds the file names (or paths) to each of the fasta files you'd like incorporated.
- An example file can be found in the GToTree sub-directory here:
GToTree/test_data/fasta_files.txt
.
Amino-acid files can be provided similarly to nucleotide fasta files, but passed to the -A
argument. You need to provide a single-column file that holds the file names (or paths) to each of the amino-acid fasta files you'd like incorporated.
- An example file can be found in the GToTree sub-directory here:
GToTree/test_data/amino_acid_files.txt
.
GToTree also needs to know which SCG-set to use – passed with the -H
flag. There are 14 provided with the program that are stored in the hmm_sets
sub-directory (discussed in some more detail here). If you followed the conda quick-start installation instructions, or set up the appropriate environment variable yourself as detailed here, you can view which HMM files are available by running gtt-hmms
by itself (and you don't need to specify the full path to the HMM file, just the name as printed by gtt-hmms
, e.g. -H Bacterial.hmm
).
Each GToTree run creates an output directory to hold all of the output files. This defaults to "GToTree_output", but can be specified with the -o
argument, and the names of the files below that include "GToTree_output" would be changed accordingly.
-
Aligned_SCGs.tre
- The final tree file in newick format.
- FastTree reports "local support values" that appear as labels on internal nodes to estimate the reliability of each split in the tree. You can find more information about this at their user page here.
- IQ-TREE reports ultrafast bootstrap (UFBoot) support values. Their help pages state that values of 95% indicate a 95% probability that clade is true.
- If run with the
-N
option, no tree will be produced, and only the alignment will be generated.
-
Aligned_SCGs.faa
- Alignment file in fasta format.
-
Aligned_SCGs_mod_names.faa (if TaxonKit was used to add lineage info to labels – specified with the
-t
flag)
-
Partitions.txt
- A partitions file compatible with treeing programs capable of using different models for each gene. See, for example, iqtree's info here.
-
Genomes_summary_info.tsv
- A tab-delimited table of summary information for each genome including the following columns:
Column | Name | Contents |
---|---|---|
1 | assembly_id | the input assembly ID (either the accession or base file name depending on input source) |
2 | label | the label assigned to the genome in the output tree file |
3 | label_source | where the label came from |
4 | taxid | the NCBI taxid if genome was provided by NCBI accession or GenBank with taxid information |
5 | num_SCG_hits | number of gene hits to the target HMMs |
6 | uniq_SCG_hits | number of unique gene hits to the target HMMs |
7 | perc_comp | estimated percent completion based on the target HMMs |
8 | perc_redund | estimated percent redundancy based on the target HMMs |
9 | num_SCG_hits_after_len_filt | number of gene hits to the target HMMs following length filtering |
10 | in_final_tree | Yes or No, did this genome end up in the final tree |
Depending on if taxonomy info is added or not, there may be additional columns with taxonomic info.
-
All_genomes_SCG_hit_counts.tsv
- A tab-delimited file where the first column holds each genome ID, and the rest of the columns hold counts for how many hits there were to each target gene for each genome.
Report files will only be written if they are needed. For instance if a genome is dropped from analysis due to having too few hits to the target genes, the file "Genomes_removed_for_too_few_hits.tsv" will be created. But if no genomes were removed for this reason, the file will not be generated. So you should not expect to find all of these files after any particular run. These will be included in the output sub-directory run_files/
.
Redundant_input_accessions.txt
- If there were duplicate accessions in the input NCBI accessions file, they will be reported here.
NCBI_accessions_not_found.txt
- If any of the provided NCBI accessions were not found at NCBI, they will be reported here.
NCBI_accessions_not_downloaded.txt
- If any NCBI accessions were found at NCBI but neither their genes nor genome could be downloaded, they will be reported here.
Genomes_removed_for_too_few_hits.tsv
- If any genomes were removed from analysis due to having too few hits to the target genes (set with
-G
argument), they will be listed here along with how many hits they had.
Genes_with_no_hits_to_any_genomes.txt
- If any genes didn't have hits in any of the input genomes, they will be reported here.
Genbank_files_with_no_CDSs.txt
- If any GenBank files were provided that didn't have genes annotated, their genes would be called with Prodigal and the genomes retained in the analysis, but they would be reported here just in case this is a cause for a red flag for you (like if you intended to be using only fully annotated GenBank files).
The GToTree help menu can be viewed by running GToTree
with no arguments.
- [-o <str>] default: GToTree_output
GToTree writes all output files to an output directory. By default this is set to "GToTree_output", but you can specify it by passing an argument to the -o
flag. (E.g.: -o Alteromonas_output
)
- [-m ] specify desired genome labels
Often it is helpful to have specific labels for specific genomes in a tree (as exemplified in the Alteromonas example). GToTree uses TaxonKit to add lineage information to any genomes that have such information associated with them (whether provided as NCBI accessions or GenBank files), if we specify we want NCBI taxonomies added (with the -t
flag), and it uses internal scripts to add GTDB taxonomy if specified (with the -D
flag). We can also swap labels of specific genomes we know we care about and want to be able to find more easily. We also may want to just append certain information to the label. For example, maybe we want the lineage information added, but we may also know something about specific genomes that we want marked on the tree also (like they all possess a certain gene type we are interested in for some reason and we want to be able to quickly search and highlight them on the tree).
Either or both of these can be done by providing a mapping file to the -m
argument. It should be a 2- or 3-column tab-delimited file that has the initial genome ID in the first column (this will be either the NCBI accession or the file name, depending on how the input genome was provided). The second column may or may not be empty. If we want to specify the complete label ourselves for that genome, then we should put that new label in column 2. If we don't want to specify the complete label, leave column 2 empty. Column 3 may or may not be empty. If we'd like to append something to the label (whether that's the initial label, the modified lineage label, or the label we may have specified in column 2), then add that text to column 3. If there is nothing we want to append, we should leave column 3 empty.
NOTE: Not all input genomes need to be provided in the file being passed to
-m
.
- [-t ] default: false
By setting the -t
flag, GToTree will: get strain information if it is available for those provided by NCBI accession; get the NCBI taxids for any genomes that possess them (either from the NCBI accessions provided or if they are present in any GenBank files provided) and use TaxonKit to convert them into lineage information; and add this information to the genome labels – making the output tree much more useful than just a collection of odd identifiers. Which specific taxonomic ranks get added can be specified with the -L
argument.
Specify to add GTDB lineage info to genome labels
- [-D ] default: false
By setting the -D
flag, GToTree will add GTDB taxonomy information to the labels that appear in the final alignment and in the tree. Which specific taxonomic ranks get added can be specified with the -L
argument. If the -D
flag and the -t
flag are both specified, GTDB taxonomy info will take precedence over NCBI taxonomy info when possible, and if a given accession isn't present in GTDB, the NCBI lineage info will be used (and "_NCBI" will be appended to it).
- [-L ] default: Domain,Phylum,Class,Species,Strain
Provide the -t
flag with no arguments in order to add lineage info to the genome labels. By default this will add Domain, Phylum, Class, Species, and strain info, where available. This may be suitable when making a tree across multiple domains, but may be unnecessarily cumbersome when just making a tree of one genus, for instance like shown here in the Alteromonas example. You can specify which ranks you'd like added to the labels with the -L
argument as a comma-separated list. For instance, to add all would look like this: -L Domain,Phylum,Class,Order,Family,Genus,Species,Strain
.
- [-c ] default: 0.2
When scanning many genomes for many genes, it becomes harder or completely impractical to visually inspect alignments of everything. One way to try to filter out potential spurious gene hits is to filter by some expected length. The -c
parameter uses the median length of each particular gene-set to calculate an upper- and lower-length threshold to filter out potentially spurious genes. It takes float between 0-1 specifying the range about the median of sequences to be retained. The default is 0.2. For example, under the default setting, if the median length of a set of sequences is 100 AAs, those genes with sequences longer than 120 or shorter than 80 will be filtered out before alignment of that gene set. This becomes less useful when using very few genomes however (see note here). By default, this is set to 0.2.
- [-G ] default: 0.5
The -G
parameter allows you to filter out genomes that have too few hits to the target genes. It takes a float between 0-1 specifying the minimum fraction of hits a genome must have of the SCG-set. The default is 0.5. For example, under the default setting, if there are 100 target genes in the HMM profile, and genome X only has hits to 49 of them, it will be removed from analysis. How you want this set may depend on the breadth of diversity of the tree you are making (see note here.
- [-n ] default: 2
The number of cpus you'd like to use during the HMM searches.
- [-j ] default: 1
This determines how many jobs to run in parallel during steps that are parallelizable – such as the processing/searching of each individual genome, the filtering of genes and genomes, and alignment of each individual gene-set.
- [-B ] default: false
Provide the -B
flag with no arguments if you'd like to run GToTree in "best-hit" mode. By default, if a target gene has more than one hit in a given genome, GToTree won't include a sequence for that target gene from that genome in the final alignment. With this flag provided, GToTree will take the best hit and incorporate it into the alignment, even if that genome has more than one hit to the target gene. See here for more discussion on this.
- [-d ] default: false
Provide the -d
flag with no arguments if you'd like to keep the temporary directory that is used during the run. This is mostly useful for debugging purposes.
- [-P ] default: false
Provide the -P
flag with no arguments if your system can't utilize ftp, and GToTree will use http to try to download any needed files.
- [-z ] default: false
Provide the -z
flag with no arguments if you'd like to make the alignment and tree based on nucleotide sequences instead of amino-acids (which can provide greater resolution on closely related input genomes). Note this mode can only accept NCBI accessions (passed to -a
) and genome fasta files (passed to -f
) as input sources (since we can't confidently reverse-translate amino-acid seqs). GToTree still finds the target genes based on amino-acid HMM searches.
If you run GToTree
with no arguments you can see the help menu:
GToTree v1.8.8
(github.com/AstrobioMike/GToTree)
---------------------------------- HELP INFO ----------------------------------
This program takes input genomes from various sources and ultimately produces
a phylogenomic tree. You can find detailed usage information at:
github.com/AstrobioMike/GToTree/wiki
------------------------------- REQUIRED INPUTS -------------------------------
1) Input genomes in one or any combination of the following formats:
- [-a <file>] single-column file of NCBI assembly accessions
- [-g <file>] single-column file with the paths to each GenBank file
- [-f <file>] single-column file with the paths to each fasta file
- [-A <file>] single-column file with the paths to each amino acid file,
each file should hold the coding sequences for just one genome
2) [-H <file>] location of the uncompressed target SCGs HMM file
being used, or just the HMM name if the 'GToTree_HMM_dir' env variable
is set to the appropriate location (which is done by conda install), run
'gtt-hmms' by itself to view the available gene-sets)
------------------------------- OPTIONAL INPUTS -------------------------------
Output directory specification:
- [-o <str>] default: GToTree_output
Specify the desired output directory.
User-specified modification of genome labels:
- [-m <file>] mapping file specifying desired genome labels
A two- or three-column tab-delimited file where column 1 holds either
the file name or NCBI accession of the genome to name (depending
on the input source), column 2 holds the desired new genome label,
and column 3 holds something to be appended to either initial or
modified labels (e.g. useful for "tagging" genomes in the tree based
on some characteristic). Columns 2 or 3 can be empty, and the file does
not need to include all input genomes.
Options for adding taxonomy information:
- [-t ] add NCBI taxonomy; default: false
Provide this flag with no arguments if you'd like to add NCBI taxonomy
info to the sequence headers for any genomes with NCBI taxids. This will
will largely be effective for input genomes provided as NCBI accessions
(provided to the `-a` argument), but any input GenBank files will also
be searched for an NCBI taxid. See `-L` argument for specifying desired
ranks.
- [-D ] add GTDB taxonomy; default: false
Provide this flag with no arguments if you'd like to add taxonomy from the
Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
effective for input genomes provided as NCBI accessions (provided to the
`-a` argument). This can be used in combination with the `-t` flag, in
which case any input accessions not represented in the GTDB will have NCBI
taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
for help getting input accessions based on GTDB taxonomy searches.
- [-L <str>] specify wanted lineage ranks; default: Domain,Phylum,Class,Species,Strain
A comma-separated list of the taxonomic ranks you'd like added to
the labels if adding taxonomic information. E.g., all would be
"-L Domain,Phylum,Class,Order,Family,Genus,Species,Strain". Note that
strain-level information is available through NCBI, but not GTDB.
Filtering settings:
- [-c <float>] sequence length cutoff; default: 0.2
A float between 0-1 specifying the range about the median of
sequences to be retained. For example, if the median length of a
set of sequences is 100 AAs, those seqs longer than 120 or shorter
than 80 will be filtered out before alignment of that gene set
with the default 0.2 setting.
- [-G <float>] genome hits cutoff; default: 0.5
A float between 0-1 specifying the minimum fraction of hits a
genome must have of the SCG-set. For example, if there are 100
target genes in the HMM profile, and Genome X only has hits to 49
of them, it will be removed from analysis with default value 0.5.
- [-B ] best-hit mode; default: false
Provide this flag with no arguments if you'd like to run GToTree
in "best-hit" mode. By default, if a SCG has more than one hit
in a given genome, GToTree won't include a sequence for that target
from that genome in the final alignment. With this flag provided,
GToTree will use the best hit. See here for more discussion:
github.com/AstrobioMike/GToTree/wiki/things-to-consider
KO searching:
- [-K <file>] single-column file of KO targets to search each genome for
Table of hit counts, fastas of hit sequences, and files compatible
with the iToL web-based tree-viewer will be generated for each
target. See visualization of gene presence/absence example at
github.com/AstrobioMike/GToTree/wiki/example-usage for example.
Pfam searching:
- [-p <file>] single-column file of Pfam targets to search each genome for
Table of hit counts, fastas of hit sequences, and files compatible
with the iToL web-based tree-viewer will be generated for each
target. See visualization of gene presence/absence example at
github.com/AstrobioMike/GToTree/wiki/example-usage for example.
General run settings:
- [-z ] nucleotide mode; default: false
Make alignment and/or tree with nucleotide sequences instead of amino-acid
sequences. Note this mode can only accept NCBI accessions (passed to `-a`)
and genome fasta files (passed to `-f` as input sources. (GToTree still
finds target genes based on amino-acid HMM searches.)
- [-N ] do not make a tree; default: false
No tree. Generate alignment only.
- [-k ] keep individual target gene alignments; default: false
Keep individual alignment files.
- [-T <str>] tree program to use; default: FastTreeMP if available, FastTree if not
Which program to use for tree generation. Currently supported are
"FastTree", "FastTreeMP", and "IQ-TREE". As of now, these run with
default settings only (and IQ-TREE includes "-m MFP" and "-B 1000"). To
run either with more specific options (and there is a lot of room for
variation here), you can use the output alignment file from GToTree (and
the partitions file if wanted for mixed-model specification) as input into
a dedicated treeing program.
Note on FastTreeMP (http://www.microbesonline.org/fasttree/#OpenMP). FastTreeMP
parallelizes some steps of the treeing step. Currently, conda installs
FastTreeMP with FastTree on linux systems, but not on Mac OSX systems.
So if using the conda installation, you may not have FastTreeMP if on a Mac,
in which case FastTree will be used instead - this will be reported when the
program starts, and be in the log file.
- [-n <int> ] num cpus; default: 2
The number of cpus you'd like to use during the HMM search. (Given
these are individual small searches on single genomes, 2 is probably
always sufficient. Keep in mind this will be multiplied by the number of jobs
running concurrently if also modifying the `-j` parameter.)
- [-M <int> ] num muscle threads; default: 5
The number of threads muscle will use during alignment. (Keep in mind
this will be multiplied by the number of jobs running concurrently
if also modifying the `-j` parameter.)
- [-j ] num jobs; default: 1
The number of jobs you'd like to run in parallel during steps
that are parallelizable. This includes things like downloading input
accession genomes and running parallel alignments, and portions of the
tree step if using FastTree on a Linux system (e.g. see FastTree docs
here: http://www.microbesonline.org/fasttree/#OpenMP).
Note that I've occassionally noticed NCBI not being happy with over ~50
downloads being attempted concurrently. So if using a `-j` setting around
there or higher, and GToTree is saying a lot of input accessions were not
successfully downloaded, consider trying with fewer.
- [-X ] override super5 alignment; default: false
If working with greater than 1,000 target genomes, GToTree will by default
use the 'super5' muscle alignment algorithm to increase the speed of the alignments (see
github.com/AstrobioMike/GToTree/wiki/things-to-consider#working-with-many-genomes
for more details and the note just above there on using representative genomes).
Anyway, provide this flag with no arguments if you don't want to speed up
the alignments.
- [-P ] use http instead of ftp; default: false
Provide this flag with no arguments if your system can't use ftp,
and you'd like to try using http.
- [-F ] force overwrite; default: false
Provide this flag with no arguments if you'd like to force
overwriting the output directory if it exists.
- [-d ] debug mode; default: false
Provide this flag with no arguments if you'd like to keep the
temporary directory. (Mostly useful for debugging.)
-------------------------------- EXAMPLE USAGE --------------------------------
GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4
NOTE
In order to give conda more freedom in managing its environments, specific versions have been removed from the conda installation. If you installed with conda, and want to know the specific versions, you can check in the environment withconda list
.
Prodigal is run with default settings other than setting the -c
flag, which means only include complete genes.
Hmmsearch is run with default settings other than setting the --cut_ga
flag, which uses the gathering score stored in the HMM profile being used for cutoff values.
Muscle is run with default settings using the -align
PPP algorithm when working with fewer than 1,000 target genomes. When run with greater than 1,000 input genomes, the -super5
algorithm is used. This can be overridden by adding -X
to the GToTree call – see here for more on this.
Trimal is run with default settings other than setting the -automated1
flag, which performs "a heuristic selection of the automatic method based on similarity statistics. (Optimized for Maximum Likelihood phylogenetic tree reconstruction)."
FastTree and FastTreeMP are run with default settings for amino-acid alignments, and with the flags -nt
and -gtr
for nucleotide alignments.
IQ-TREE is currently run with default settings other than -nt
, -bb 1000
, and -mset WAG,LG
.
Estimated genome completeness is calculated as the number of unique target genes identified divided by the total number of target genes searched (* 100 to make it a percentage). So, if 100 single-copy genes (SCGs) were searched, and 95 of those were found at least 1, the estimated completeness would be 95%.
Estimated genome redundancy is calculated as the number of copies greater than 1 that were identified for all target genes divided by the total number of target HMM genes searched (* 100 to make it a percentage). So, if 100 SCGs were searched, and 1 gene was detected in 2 copies, that would be 1 copy over the target: 1 / 100 * 100 = 1% estimated redundancy. If 1 gene was detected in 11 copies, that would be 10 over the target: 10 / 100 * 100 = 10% estimated redundancy. This would come out to the same estimate if 10 individual genes were detected in 2 copies each. This would still be 10 over the target: 10 / 100 * 100 = 10% estimated redundancy.
GToTree relies on many great programs. Along with all other outputs, it will generate a citations.txt
file with citation information specific for every run that accounts for all programs it relies upon. Please be sure to cite the developers appropriately :)
Here is an example output citations.txt
file from a run, and how I'd cite it in the methods:
GToTree v1.6.31
Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi:10.1093/bioinformatics/btz188
Prodigal v2.6.3
Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223–2230. doi.org/10.1186/1471-2105-11-119
HMMER3 v3.3.2
Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi:10.1371/journal.pcbi.1002195
Muscle v5.1
Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021. doi.org/10.1101/2021.06.20.449169
TrimAl v1.4.rev15
Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972–1973. doi:10.1093/bioinformatics/btp348
TaxonKit v0.9.0
Shen W and Ren H. TaxonKit: a practical and efficient NCBI Taxonomy toolkit. Journal of Genetics and Genomics. 2021. doi.org/10.1016/j.jgg.2021.03.006
FastTree 2 v2.1.11
Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi:10.1371/journal.pone.0009490
Example methods text based on above citation output (be sure to modify as appropriate for your run)
The archaeal phylogenomic tree was produced with GToTree v1.6.31 (Lee 2019), using the prepackaged single-copy gene-set for archaea (76 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes on input genomes provided as fasta files. Target genes were identified with HMMER3 v3.2.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). TaxonKit (Shen and Ren 2021) was used to connect full lineages to taxonomic IDs.
Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider
- Home
- What is GToTree?
- Installation
- Example usage
- User Guide
- SCG-sets
- Things to consider