-
Notifications
You must be signed in to change notification settings - Fork 26
Installation
GToTree runs in a Unix-like command-line environment. This means it will work on Mac and Linux computers in the standard terminal programs available with them. And to use GToTree on a Windows computer, I would recommend installing the Windows Subsystem for Linux (WSL), then when in the WSL terminal, install a Linux version of miniconda. Then installing with conda
as shown below will work in the WSL environment 👍
If you don't already have the glorious package manager conda, I highly recommend you get it. This really isn't the venue to go into why it's so helpful, but it really is, I promise 🙂
To get conda up and running (which is very quick), you can follow the instructions to install miniconda (a light-weight version) for your appropriate system starting from here. You will want a python 3.X version, and more than likely a 64-bit version. And if you'd like to learn more about conda sometime, I have an introduction page here 🙂
The following line will create a gtotree conda environment and install GToTree, you want to run these in the base conda environment:
# installing mamba if needed first (for faster conda installs)
conda install -n base -c conda-forge mamba
mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree
DONE!
Now you should be able to enter and exit the environment with conda activate gtotree
and conda deactivate gtotree
. If you enter the environment and run the following:
gtt-hmms
It will print out where the GToTree default HMMs directory is located, and list the available pre-built HMMs. And if you enter GToTree
with no arguments, you can see the help menu.
You can run a test that takes about 3 minutes like so:
gtt-test.sh
For which the end of the standard output should look like this:
#################################################################################
#### Done!! ####
#################################################################################
Overall, 12 genomes of the input 14 were retained (see notes below).
Tree written to:
GToTree-test-output/GToTree-test-output.tre
Alignment written to:
GToTree-test-output/Aligned_SCGs_mod_names.faa
Main genomes summary table written to:
GToTree-test-output/Genomes_summary_info.tsv
Summary table with hits per target gene per genome written to:
GToTree-test-output/SCG_hit_counts.tsv
Outputs from Pfam searching written to:
GToTree-test-output/Pfam_search_results/
Partitions file (for downstream use with mixed-model treeing) written to:
GToTree-test-output/run_files/Partitions.txt
_______________________________________________________________________________
Notes:
1 accession(s) not successfully found at NCBI.
1 genome(s) removed due to having too few hits to the targeted SCGs.
2 gene(s) either had no hits or only multiple hits in each genome.
Reported along with additional informative run files in:
GToTree-test-output/run_files/
_______________________________________________________________________________
Log file written to:
GToTree-test-output/gtotree-runlog.txt
_______________________________________________________________________________
Programs used and their citations have been written to:
GToTree-test-output/citations.txt
_______________________________________________________________________________
Total process runtime: 0 hours and 2 minutes.
And if you took that output tree file "GToTree-test-output/GToTree-test-output.tre" and threw it into a tree viewer, such as uploading it to the Interactive Tree of Life site, rooting it at the included archaeal sequence, and dragging and dropping in the "GToTree-test-output/Pfam_search_results/iToL_files/PF05400.*-iToL.txt" file, it would look something like this:
Where the blue branches go to those genomes in which the FliT protein involved in flagellar biosynthesis was detected (searched for by it's PFam, PF05400, being specified in the "pfam_targets.txt" input file).
You can clean out the test data and results by running:
gtt-clean-after-test.sh
If wanting to update to the latest GToTree version, it is best to remove the previous conda environment and install fresh. This can be done as follows:
# from outside the gtotree conda environment (assuming that's what it was named like the install above)
conda env remove -n gtotree
# then re-installing in a new environment same as above
# installing mamba if needed first (for faster conda installs)
conda install -n base -c conda-forge mamba
mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree
Then the new environment can be activated with conda activate gtotree
.
Again, the conda installation is highly recommended as it is more robust across different systems. But to try installing without conda, download and unpack/decompress GToTree wherever you'd like it to live on your system (be sure to change the versions below to the latest found here:
curl -L https://github.com/AstrobioMike/GToTree/archive/v1.5.22.tar.gz -o GToTree-v1.5.22.tar.gz
tar -xzvf GToTree-v1.5.22.tar.gz
Now we need to add the "bin" directory to our PATH (see here if you are unfamiliar with what the PATH is and you'd like to know more).
One way we can do this is change directories into the bin, and use pwd
inside an echo
command to put the full path into our PATH:
cd GToTree-1.5.22/bin # make sure you are in this bin directory
echo "export PATH=\"$(pwd):\$PATH\"" >> ~/.bash_profile
If you'd like to more easily be able to use the included single-copy gene HMM profiles, you can also add a variable to your bash profile so that you don't need to provide the full path to them whenever you use them. If you change directories into the "hmm_sets" directory, this can be done in a similar way as above:
cd ../hmm_sets/ # from where we were above
echo "export GToTree_HMM_dir=\"$(pwd)/\"" >> ~/.bash_profile
Last thing to do is source
the ~/.bash_profile we just modified so those changes take effect in our current session:
source ~/.bash_profile
You can run gtt-hmms
with no arguments to make sure the default HMM directory is set, and see what taxa the currently available HMM files can more specifically target.
And now if you type GToTree
with no arguments, you should see the help menu (but note that you still need to take care of the dependencies presented below before you're ready to rock):
GToTree v1.6.20
(github.com/AstrobioMike/GToTree)
---------------------------------- HELP INFO ----------------------------------
This program takes input genomes from various sources and ultimately produces
a phylogenomic tree. You can find detailed usage information at:
github.com/AstrobioMike/GToTree/wiki
------------------------------- REQUIRED INPUTS -------------------------------
1) Input genomes in one or any combination of the following formats:
- [-a <file>] single-column file of NCBI assembly accessions
- [-g <file>] single-column file with the paths to each GenBank file
- [-f <file>] single-column file with the paths to each fasta file
- [-A <file>] single-column file with the paths to each amino acid file,
each file should hold the coding sequences for just one genome
2) [-H <file>] location of the uncompressed HMM file being used, or just the
HMM name if you've set the environment variable 'GToTree_HMM_dir'
to the appropriate location or installed via conda (run 'gtt-hmms'
by itself to view the available gene-sets)
------------------------------- OPTIONAL INPUTS -------------------------------
Output directory specification:
- [-o <str>] default: GToTree_output
Specify the desired output directory.
User-specified modification of genome labels:
- [-m <file>] specify desired genome labels
A two- or three-column tab-delimited file where column 1 holds either
the file name or NCBI accession of the genome to name (depending
on the input source), column 2 holds the desired new genome label,
and column 3 holds something to be appended to either initial or
modified labels (e.g. useful for "tagging" genomes in the tree based
on some characteristic). Columns 2 or 3 can be empty, and the file does
not need to include all input genomes.
Options for adding taxonomy information:
- [-t ] default: false
Provide this flag with no arguments if you'd like to add NCBI taxonomy
info to the sequence headers for any genomes with NCBI taxids. This will
will largely be effective for input genomes provided as NCBI accessions
(provided to the `-a` argument), but any input GenBank files will also
be searched for an NCBI taxid. See `-L` argument for specifying desired
ranks.
- [-D ] default: false
Provide this flag with no arguments if you'd like to add taxonomy from the
Genome Taxonomy Database (GTDB; gtdb.ecogenomic.org). This will only be
effective for input genomes provided as NCBI accessions (provided to the
`-a` argument). This can be used in combination with the `-t` flag, in
which case any input accessions not represented in the GTDB will have NCBI
taxonomic infomation added (with '_NCBI' appended). See `-L` argument for
specifying desired ranks, and see helper script `gtt-get-accessions-from-GTDB`
for help getting input accessions based on GTDB taxonomy searches.
- [-L <str>] default: Domain,Phylum,Class,Species,Strain
A comma-separated list of the taxonomic ranks you'd like added to
the labels if adding taxonomic information. E.g., all would be
"-L Domain,Phylum,Class,Order,Family,Genus,Species". Note that
strain-level information is available through NCBI, but not GTDB.
Filtering settings:
- [-c <float>] default: 0.2
A float between 0-1 specifying the range about the median of
sequences to be retained. For example, if the median length of a
set of sequences is 100 AAs, those seqs longer than 120 or shorter
than 80 will be filtered out before alignment of that gene set
with the default 0.2 setting.
- [-G <float>] default: 0.5
A float between 0-1 specifying the minimum fraction of hits a
genome must have of the SCG-set. For example, if there are 100
target genes in the HMM profile, and Genome X only has hits to 49
of them, it will be removed from analysis with default value 0.5.
- [-B ] default: false
Provide this flag with no arguments if you'd like to run GToTree
in "best-hit" mode. By default, if a SCG has more than one hit
in a given genome, GToTree won't include a sequence for that target
from that genome in the final alignment. With this flag provided,
GToTree will use the best hit. See here for more discussion:
github.com/AstrobioMike/GToTree/wiki/things-to-consider
Additional PFam searching:
- [-p <file>] single-column file of additional PFam targets to search for.
Table of hit counts, fasta of hit sequences, and files compatible
with the iToL web-based tree-viewer will be generated for each
target. See visualization of gene presence/absence example at
github.com/AstrobioMike/GToTree/wiki/example-usage for example.
General run settings:
- [-N ] default: false
No tree. Generate alignment only.
- [-k ] default: false
Keep individual protein alignment files.
- [-T <str>] default: FastTreeMP if available, FastTree if not
Which program to use for tree generation. Currently supported are
"FastTree", "FastTreeMP", and "IQ-TREE". As of now, these run
with default settings only (and IQ-TREE includes "-mset WAG,LG"). To
run either with more specific options (and there is a lot of room for
variation here), you can use the output alignment file from GToTree (and
partitions file if wanted for mixed-model specification) as input into
a dedicated treeing program.
Note on FastTreeMP (http://www.microbesonline.org/fasttree/#OpenMP). FastTreeMP
parallelizes some steps of the treeing step. Currently, conda installs
FastTreeMP with FastTree on linux systems, but not on Mac OSX systems.
So if using the conda installation, you may not have FastTreeMP if on a Mac,
in which case FastTree will be used instead – this will be reported when the
program starts, and be in the log file.
- [-n <int> ] default: 2
The number of cpus you'd like to use during the HMM search. (Given
these are individual small searches on single genomes, 2 is probably
always sufficient.)
- [-j <int> ] default: 1
The number of jobs you'd like to run in parallel during steps
that are parallelizable. This includes things like downloading input
accession genomes and running parallel alignments, and portions of the
tree step if using FastTree on a Linux system (e.g. see FastTree docs
here: http://www.microbesonline.org/fasttree/#OpenMP).
Note that I've occassionally noticed NCBI not being happy with over ~50
downloads being attempted concurrently. So if using a `-j` setting around
there or higher, and GToTree is saying a lot of input accessions were not
successfully downloaded, consider trying with fewer.
- [-X ] default: false
If working with greater than 1,000 target genomes, GToTree will by default
use the 'super5' muscle alignment algorithm to increase the speed of the alignments (see
github.com/AstrobioMike/GToTree/wiki/things-to-consider#working-with-many-genomes
for more details and the note just above there on using representative genomes).
Anyway, provide this flag with no arguments if you don't want to speed up
the alignments.
- [-P ] default: false
Provide this flag with no arguments if your system can't use ftp,
and you'd like to try using http.
- [-F ] default: false
Provide this flag with no arguments if you'd like to force
overwriting the output directory if it exists.
- [-d ] default: false
Provide this flag with no arguments if you'd like to keep the
temporary directory. (Mostly useful for debugging.)
-------------------------------- EXAMPLE USAGE --------------------------------
GToTree -a ncbi_accessions.txt -f fasta_files.txt -H Bacteria -D -j 4
By far, the easiest way to get all the dependencies up and running is with conda as done above. But if you don't want to use conda, here are links to installing all the dependencies (be sure to install Easel along with HMMER3 as well if you are doing things the non-conda way).
If you use GToTree, please be sure to cite these folks – a citations.txt
file including used programs is produced with each run to help 🙂
Note on versions The versions listed below were used specifically at one point in GToTree's history, and are left here as a reference if someone is trying to install without
conda
. But with the conda installation, it can sometimes be better to be more flexible with regard to versions. We can check specific versions in our conda installation manually, and/or thecitations.txt
file produced by a GToTree run will list the versions of programs used for that run.
- Biopython - citation
- HMMER3 v3.2.1 - citation: they note in the user manual to cite the website, but there is also this paper (be sure to install Easel along with HMMER3 as well, see more at the HMMER3 install page here)
- Muscle v5.1 - citation
- Trimal v1.4.1 - citation
- FastTree v2.1.10 - citation
If you use GToTree in a manner that uses these tools, please cite these folks – a citations.txt
file including used programs is produced with each run to help 🙂
-
Prodigal v2.6.3 - citation
- if providing input genomes in fasta format, or GenBank format with no CDS annotations, or NCBI accessions to genomes with no gene calls
- if providing input genomes as NCBI assembly accessions
-
TaxonKit v0.6.0 - citation
- if adding NCBI taxonomy information to input genomes
-
Genome Taxonomy Database Release R05-RS95 - citation
- if adding GTDB taxonomy information to input genomes
-
GNU Parallel v20161122 - citation info
- if running things in parallel (specifically set with the
-j
argument)
- if running things in parallel (specifically set with the
- IQ-TREE v2.0.3 - citation
Note on versions The versions listed above were used specifically at one point in GToTree's history, and are left here as a reference if someone is trying to install without
conda
. But with the conda installation, it can sometimes be better to be more flexible with regard to versions. We can check specific versions in our conda installation manually, and/or thecitations.txt
file produced by a GToTree run will list the versions of programs used for that run.
NOTE: If doing a non-conda installation, you may need to also temporarily change your terminal's localization settings if you're not in the United States or Australia, as GToTree expect things to be encoded a certain way. If you run
locale
in the terminal, you will get a list of these. If any do not say "en_US.UTF-8", then you can run these two commands to temporarily change them (for the current terminal session):export LC_ALL="en_US.UTF-8"
andexport LANG="en_US.UTF-8"
. Now in this terminal window, GToTree will run appropriately. When you open a new terminal, your settings will be back to the way they were.
Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider
- Home
- What is GToTree?
- Installation
- Example usage
- User Guide
- SCG-sets
- Things to consider