Skip to content

Commit

Permalink
Update documentation, Compress MSA and use compressed MSA in de_novo …
Browse files Browse the repository at this point in the history
…and classify, change AF_THRESHOLD to 0.5, change split tree names, add keep_intermediates flag,
  • Loading branch information
pchaumeil committed Apr 4, 2022
1 parent da37e58 commit 9a4c1b5
Show file tree
Hide file tree
Showing 11 changed files with 236 additions and 117 deletions.
18 changes: 15 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,26 @@
[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
[![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)

<b>[GTDB-Tk v1.5.0](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April 23, 2021 along with new reference data for [GTDB R06-RS202](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>
<b> Please note v1.5.0+ is not compatible with GTDB R05-RS95. </b>
<b>[GTDB-Tk v2.0.1](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April xx, 2022 along with new reference data for [GTDB R07-RS207](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>
<b> Please note v2.0.1+ is not compatible with GTDB R06-RS202. </b>

GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes. The GTDB-Tk is open source and released under the [GNU General Public License (Version 3)](https://www.gnu.org/licenses/gpl-3.0.en.html).

Notifications about GTDB-Tk releases will be available through the GTDB Twitter account (https://twitter.com/ace_gtdb).

Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about).
Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about).

## New Features
GTDB-Tk v2.0.1+ includes the following new features:
- Classification is done by default using a divide-and-conquer strategy to systematically reduce the size of the reference tree and associated memory requirements.
When runnning with R07-RS207, GTDB-Tk requieres **320GB** or RAM when running pplacer with the full bacterial tree. The divide and conquer approach reduve this requirement to around **20GB** of RAM.
**This is now the default option strategy in GTDB-Tk.**
- To use the full reference tree in the classification step, use the `-f,--full-tree` option.
- Use of a refined set of 53 archaeal-specific marker genes based on a recent published analysis of archaeal markers.
- To reduce the size of the output directory,
- all intermediate_results folders ( in _identify,align,classify,infer_) are **now removed** after the end of the `classify_wf` and `de_novo_wf` pipelines. To keep intermediates files use the flag `--keep-intermediates`.
- all msa output from the align step are now automatically archived.


## Documentation
https://ecogenomics.github.io/GTDBTk/
Expand Down
11 changes: 11 additions & 0 deletions docs/src/announcements.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
Announcements
=============


GTDB R207 available
------------------

*April xx, 2022*

* GTDB Release 202 is now available and will be used from version ``2.0.1`` and up.
* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
`gtdbtk_r207_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/auxillary_files>`_.


GTDB R202 available
------------------

Expand Down
8 changes: 4 additions & 4 deletions docs/src/installing/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ Hardware requirements
- Storage
- Time
* - Archaea
- ~13 GB
- ~27 GB
- ~34 GB
- ~30 GB
- ~1 hour / 1,000 genomes @ 64 CPUs
* - Bacteria
- ~215 GB
- ~27 GB
- ~320 GB ( 20GB for divide-and-conquer)
- ~30 GB
- ~1 hour / 1,000 genomes @ 64 CPUs

.. note::
Expand Down
6 changes: 5 additions & 1 deletion gtdbtk/biolib_lite/seq_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,15 +122,19 @@ def read_fasta_seq(fasta_file, keep_annotation=False):

try:
open_file = open
mode = 'r'
if fasta_file.endswith('.gz'):
open_file = gzip.open
mode = 'rb'

seq_id = None
annotation = None
seq = None
with open_file(fasta_file, 'r') as f:
with open_file(fasta_file, mode) as f:

for line in f.readlines():
if isinstance(line, bytes):
line = line.decode()
# skip blank lines
if not line.strip():
continue
Expand Down
95 changes: 52 additions & 43 deletions gtdbtk/classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def place_genomes(self,
cur_gb=mem_total))

# rename user MSA file for compatibility with pplacer
if not user_msa_file.endswith('.fasta'):
if not user_msa_file.endswith('.fasta') and not user_msa_file.endswith('.gz'):
if marker_set_id == 'bac120':
t = PATH_BAC120_USER_MSA.format(prefix=prefix)
elif marker_set_id == 'ar53':
Expand Down Expand Up @@ -193,14 +193,14 @@ def place_genomes(self,
elif levelopt == 'high':
self.logger.log(Config.LOG_TASK,
f'Placing {num_genomes:,} bacterial genomes '
f'into high reference tree with pplacer using '
f'into backbone reference tree with pplacer using '
f'{self.pplacer_cpus} CPUs (be patient).')
pplacer_ref_pkg = os.path.join(Config.HIGH_PPLACER_DIR,
Config.HIGH_PPLACER_REF_PKG)
elif levelopt == 'low':
self.logger.log(Config.LOG_TASK,
f'Placing {num_genomes:,} bacterial genomes '
f'into low reference tree {tree_iter} ({idx_tree}/{number_low_trees}) with '
f'into order-level reference tree {tree_iter} ({idx_tree}/{number_low_trees}) with '
f'pplacer using {self.pplacer_cpus} CPUs '
f'(be patient).')
pplacer_ref_pkg = os.path.join(Config.LOW_PPLACER_DIR,
Expand Down Expand Up @@ -275,41 +275,41 @@ def place_genomes(self,
pplacer.tog(pplacer_json_out, tree_file)

# Symlink to the tree summary file
if marker_set_id == 'bac120' and levelopt is None:
symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
elif levelopt == 'high':
symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
elif levelopt == 'low':
symlink_f(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter),
os.path.join(out_dir,
os.path.basename(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter))))
elif marker_set_id == 'ar53':
symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
else:
self.logger.error('There was an error determining the marker set.')
raise GenomeMarkerSetUnknown
# if marker_set_id == 'bac120' and levelopt is None:
# symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
# elif levelopt == 'high':
# symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
# elif levelopt == 'low':
# symlink_f(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter),
# os.path.join(out_dir,
# os.path.basename(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter))))
# elif marker_set_id == 'ar53':
# symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
# else:
# self.logger.error('There was an error determining the marker set.')
# raise GenomeMarkerSetUnknown

# Symlink to the tree summary file
if marker_set_id == 'bac120':
if levelopt is None:
symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
elif levelopt == 'high':
symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
elif levelopt == 'low':
symlink_f(PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix),
os.path.join(out_dir, os.path.basename(
PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix))))
elif marker_set_id == 'ar53':
symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
else:
self.logger.error('There was an error determining the marker set.')
raise GenomeMarkerSetUnknown
# if marker_set_id == 'bac120':
# if levelopt is None:
# symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
# elif levelopt == 'high':
# symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
# elif levelopt == 'low':
# symlink_f(PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix),
# os.path.join(out_dir, os.path.basename(
# PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix))))
# elif marker_set_id == 'ar53':
# symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
# os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
# else:
# self.logger.error('There was an error determining the marker set.')
# raise GenomeMarkerSetUnknown

return tree_file

Expand Down Expand Up @@ -360,17 +360,27 @@ def run(self,
if marker_set_id == 'ar53':
marker_summary_fh = CopyNumberFileAR53(align_dir, prefix)
marker_summary_fh.read()
user_msa_file = os.path.join(align_dir,
PATH_AR53_USER_MSA.format(prefix=prefix))
if os.path.isfile(os.path.join(align_dir,
PATH_AR53_USER_MSA.format(prefix=prefix))):
user_msa_file = os.path.join(align_dir,
PATH_AR53_USER_MSA.format(prefix=prefix))
else:
user_msa_file = os.path.join(align_dir,
PATH_AR53_USER_MSA.format(prefix=prefix)+'.gz')
summary_file = ClassifySummaryFileAR53(out_dir, prefix)
red_dict_file = REDDictFileAR53(out_dir, prefix)
disappearing_genomes_file = DisappearingGenomesFileAR53(out_dir, prefix)
pplacer_classify_file = PplacerClassifyFileAR53(out_dir, prefix)
elif marker_set_id == 'bac120':
marker_summary_fh = CopyNumberFileBAC120(align_dir, prefix)
marker_summary_fh.read()
user_msa_file = os.path.join(align_dir,
PATH_BAC120_USER_MSA.format(prefix=prefix))
if os.path.isfile(os.path.join(align_dir,
PATH_BAC120_USER_MSA.format(prefix=prefix))):
user_msa_file = os.path.join(align_dir,
PATH_BAC120_USER_MSA.format(prefix=prefix))
else:
user_msa_file = os.path.join(align_dir,
PATH_BAC120_USER_MSA.format(prefix=prefix)+'.gz')
summary_file = ClassifySummaryFileBAC120(out_dir, prefix)
red_dict_file = REDDictFileBAC120(out_dir, prefix)
disappearing_genomes_file = DisappearingGenomesFileBAC120(out_dir, prefix)
Expand All @@ -396,8 +406,6 @@ def run(self,

msa_dict = read_fasta(user_msa_file)



if not fulltreeopt and marker_set_id == 'bac120':
splitter = Split(self.order_rank, self.gtdb_taxonomy, self.reference_ids)
# run pplacer to place bins in reference genome tree
Expand Down Expand Up @@ -531,7 +539,8 @@ def run(self,
tree_mapping_file.write()

# Write the summary file to disk.
disappearing_genomes_file.write()
if disappearing_genomes_file.data:
disappearing_genomes_file.write()
summary_file.write()

def _generate_summary_file(self, marker_set_id, prefix, out_dir, debugopt=None, fulltreeopt=None):
Expand Down
12 changes: 9 additions & 3 deletions gtdbtk/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ def __help(group):

def __pplacer_cpus(group):
group.add_argument('--pplacer_cpus', type=int, default=None,
help='use ``pplacer_cpus`` during placement (default: ``cpus``)')
help='number of CPUs to use during pplacer placement')


def __scratch_dir(group):
Expand Down Expand Up @@ -264,13 +264,17 @@ def __mash_db(group):

def __min_af(group):
group.add_argument('--min_af', type=float, default=AF_THRESHOLD,
help='minimum alignment fraction to consider closest genome')
help='minimum alignment fraction to assign genome to a species cluster')


def __untrimmed_msa(group, required):
group.add_argument('--untrimmed_msa', type=str, default=None, required=required,
help="path to the untrimmed MSA file")

def __keep_intermediates(group):
group.add_argument('--keep_intermediates', default=False, action='store_true',
help='keep intermediate files in the final directory')


def __output(group, required):
group.add_argument('--output', type=str, default=None, required=required,
Expand Down Expand Up @@ -335,6 +339,7 @@ def get_main_parser():
__cpus(grp)
__force(grp)
__temp_dir(grp)
__keep_intermediates(grp)
__debug(grp)
__help(grp)

Expand All @@ -346,6 +351,7 @@ def get_main_parser():
with arg_group(parser, 'required named arguments') as grp:
__out_dir(grp, required=True)
with arg_group(parser, 'optional arguments') as grp:
__full_tree(grp)
__extension(grp)
__min_perc_aa(grp)
__prefix(grp)
Expand All @@ -355,7 +361,7 @@ def get_main_parser():
__force(grp)
__scratch_dir(grp)
#__recalculate_red(grp)
__full_tree(grp)
__keep_intermediates(grp)
__min_af(grp)
__temp_dir(grp)
__debug(grp)
Expand Down
20 changes: 10 additions & 10 deletions gtdbtk/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,7 +281,7 @@
BAC_MARKER_COUNT = 120

# Information about alignment Fraction to resolve fastANI results
AF_THRESHOLD = 0.65
AF_THRESHOLD = 0.5

# MSA file names
CONCAT_BAC120 = os.path.join(MSA_FOLDER, f"gtdb_{VERSION_DATA}_bac120.faa")
Expand Down Expand Up @@ -316,15 +316,15 @@
MRCA_RED_AR53 = os.path.join(RED_DIR, f"gtdbtk_{VERSION_DATA}_ar53.tsv")

# Hashing information for validating the reference package.
REF_HASHES = {PPLACER_DIR: '4d931b5109a240602f55228029b87ee768da8141',
MASK_DIR: '36d6ac371d247b2b952523b9798e78908ea323fa',
MARKER_DIR: '2ba5ae35fb272462663651d18fd9e523317e48cd',
RADII_DIR: '9f9a2e21e27b9049044d04d731795499414a365c',
MSA_FOLDER: 'b426865245c39ee9f01b0392fb8f7867a9f76f0a',
METADATA_DIR: '7640aed96fdb13707a2b79b746a94335faabd6df',
TAX_FOLDER: '4a7a1e4047c088e92dee9740206499cdb7e5beca',
FASTANI_DIR: '70439cf088d0fa0fdbb4f47b4a6b47e199912139',
RED_DIR: 'ad6a184150e7b6e58547912660a17999fadcfbff'}
REF_HASHES = {PPLACER_DIR: '20903925a856a58b102a7b0ce160c5cbd2cf675b',
MASK_DIR: '50e414a9de18170e8cb97f990f89ff60a0fe29d5',
MARKER_DIR: '163f542c3f0a40f59df45d453aa235b39aa96e27',
RADII_DIR: '8fd13b1c5d7a7b073ba96fb628581613b293a374',
MSA_FOLDER: '4bd032c90d5e5f0cbc96338445721a317f7d90b4',
METADATA_DIR: '9772fbeac1311b31e10293fa610eb33aa1ec8e15',
TAX_FOLDER: '6fb0233b05633242369b40c026fd1ee53e266afa',
FASTANI_DIR: '973c456c02f55bb82908a6811c7076e207e9b206',
RED_DIR: '7b8b67b3157204b470c9eb809d3c39c4effffabc'}

# Config values for checking GTDB-Tk on startup.
GTDBTK_VER_CHECK = True
Expand Down
1 change: 0 additions & 1 deletion gtdbtk/decorate.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,7 +289,6 @@ def _leaf_taxa(self, leaf):

parent = parent.parent_node

print(leaf_taxa)
ordered_taxa = leaf_taxa[::-1]

# fill in missing ranks
Expand Down
Loading

0 comments on commit 9a4c1b5

Please sign in to comment.