Update documentation, Compress MSA and use compressed MSA in de_novo …

…and classify, change AF_THRESHOLD to 0.5, change split tree names, add keep_intermediates flag,
Ecogenomics · Apr 4, 2022 · 9a4c1b5 · 9a4c1b5
1 parent da37e58
commit 9a4c1b5
Show file tree

Hide file tree

Showing 11 changed files with 236 additions and 117 deletions.
diff --git a/README.md b/README.md
@@ -7,14 +7,26 @@
 [![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 [![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 
-<b>[GTDB-Tk v1.5.0](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April 23, 2021 along with new reference data for [GTDB R06-RS202](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>  
-<b> Please note v1.5.0+ is not compatible with GTDB R05-RS95. </b>
+<b>[GTDB-Tk v2.0.1](https://ecogenomics.github.io/GTDBTk/announcements.html) was released on April xx, 2022 along with new reference data for [GTDB R07-RS207](https://gtdb.ecogenomic.org/). Upgrading is recommended.</b>  
+<b> Please note v2.0.1+ is not compatible with GTDB R06-RS202. </b>
 
 GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy [GTDB](https://gtdb.ecogenomic.org/). It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes. The GTDB-Tk is open source and released under the [GNU General Public License (Version 3)](https://www.gnu.org/licenses/gpl-3.0.en.html).
 
 Notifications about GTDB-Tk releases will be available through the GTDB Twitter account (https://twitter.com/ace_gtdb).
 
-Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about). 
+Please post questions and issues related to GTDB-Tk on the Issues section of the GitHub repository. Questions related to the [GTDB](https://gtdb.ecogenomic.org/) should be sent to the [GTDB team](https://gtdb.ecogenomic.org/about).
+
+## New Features
+GTDB-Tk v2.0.1+ includes the following new features:
+- Classification is done by default using a divide-and-conquer strategy to systematically reduce the size of the reference tree and associated memory requirements. 
+When runnning with R07-RS207, GTDB-Tk requieres **320GB** or RAM when running pplacer with the full bacterial tree. The divide and conquer approach reduve this requirement to around **20GB** of RAM.
+**This is now the default option strategy in GTDB-Tk.**
+- To use the full reference tree in the classification step, use the `-f,--full-tree` option.
+- Use of a refined set of 53 archaeal-specific marker genes based on a recent published analysis of archaeal markers.
+- To reduce the size of the output directory, 
+  - all intermediate_results folders ( in _identify,align,classify,infer_) are **now removed** after the end of the `classify_wf` and `de_novo_wf` pipelines. To keep intermediates files use the flag `--keep-intermediates`.
+  - all msa output from the align step are now automatically archived.
+
 
 ## Documentation
 https://ecogenomics.github.io/GTDBTk/

diff --git a/docs/src/announcements.rst b/docs/src/announcements.rst
@@ -1,6 +1,17 @@
 Announcements
 =============
 
+
+GTDB R207 available
+------------------
+
+*April xx, 2022*
+
+* GTDB Release 202 is now available and will be used from version ``2.0.1`` and up.
+* This version of GTDB-Tk requires a new version of the GTDB-Tk reference package
+  `gtdbtk_r207_data.tar.gz <https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/auxillary_files>`_.
+
+
 GTDB R202 available
 ------------------
 

diff --git a/docs/src/installing/index.rst b/docs/src/installing/index.rst
@@ -34,12 +34,12 @@ Hardware requirements
      - Storage
      - Time
    * - Archaea
-     - ~13 GB
-     - ~27 GB
+     - ~34 GB
+     - ~30 GB
      - ~1 hour / 1,000 genomes @ 64 CPUs
    * - Bacteria
-     - ~215 GB
-     - ~27 GB
+     - ~320 GB ( 20GB for divide-and-conquer)
+     - ~30 GB
      - ~1 hour / 1,000 genomes @ 64 CPUs
 
 .. note::

diff --git a/gtdbtk/biolib_lite/seq_io.py b/gtdbtk/biolib_lite/seq_io.py
@@ -122,15 +122,19 @@ def read_fasta_seq(fasta_file, keep_annotation=False):
 
     try:
         open_file = open
+        mode = 'r'
         if fasta_file.endswith('.gz'):
             open_file = gzip.open
+            mode = 'rb'
 
         seq_id = None
         annotation = None
         seq = None
-        with open_file(fasta_file, 'r') as f:
+        with open_file(fasta_file, mode) as f:
 
             for line in f.readlines():
+                if isinstance(line, bytes):
+                    line = line.decode()
                 # skip blank lines
                 if not line.strip():
                     continue

diff --git a/gtdbtk/classify.py b/gtdbtk/classify.py
@@ -156,7 +156,7 @@ def place_genomes(self,
                                                        cur_gb=mem_total))
 
         # rename user MSA file for compatibility with pplacer
-        if not user_msa_file.endswith('.fasta'):
+        if not user_msa_file.endswith('.fasta') and not user_msa_file.endswith('.gz'):
             if marker_set_id == 'bac120':
                 t = PATH_BAC120_USER_MSA.format(prefix=prefix)
             elif marker_set_id == 'ar53':
@@ -193,14 +193,14 @@ def place_genomes(self,
             elif levelopt == 'high':
                 self.logger.log(Config.LOG_TASK,
                                 f'Placing {num_genomes:,} bacterial genomes '
-                                f'into high reference tree with pplacer using '
+                                f'into backbone reference tree with pplacer using '
                                 f'{self.pplacer_cpus} CPUs (be patient).')
                 pplacer_ref_pkg = os.path.join(Config.HIGH_PPLACER_DIR,
                                                Config.HIGH_PPLACER_REF_PKG)
             elif levelopt == 'low':
                 self.logger.log(Config.LOG_TASK,
                                 f'Placing {num_genomes:,} bacterial genomes '
-                                f'into low reference tree {tree_iter} ({idx_tree}/{number_low_trees}) with '
+                                f'into order-level reference tree {tree_iter} ({idx_tree}/{number_low_trees}) with '
                                 f'pplacer using {self.pplacer_cpus} CPUs '
                                 f'(be patient).')
                 pplacer_ref_pkg = os.path.join(Config.LOW_PPLACER_DIR,
@@ -275,41 +275,41 @@ def place_genomes(self,
         pplacer.tog(pplacer_json_out, tree_file)
 
         # Symlink to the tree summary file
-        if marker_set_id == 'bac120' and levelopt is None:
-            symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
-                      os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
-        elif levelopt == 'high':
-            symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
-                      os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
-        elif levelopt == 'low':
-            symlink_f(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter),
-                      os.path.join(out_dir,
-                                   os.path.basename(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter))))
-        elif marker_set_id == 'ar53':
-            symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
-                      os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
-        else:
-            self.logger.error('There was an error determining the marker set.')
-            raise GenomeMarkerSetUnknown
+        # if marker_set_id == 'bac120' and levelopt is None:
+        #     symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
+        #               os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
+        # elif levelopt == 'high':
+        #     symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
+        #               os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
+        # elif levelopt == 'low':
+        #     symlink_f(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter),
+        #               os.path.join(out_dir,
+        #                            os.path.basename(PATH_LOW_BAC120_TREE_FILE.format(prefix=prefix, iter=tree_iter))))
+        # elif marker_set_id == 'ar53':
+        #     symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
+        #               os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
+        # else:
+        #     self.logger.error('There was an error determining the marker set.')
+        #     raise GenomeMarkerSetUnknown
 
         # Symlink to the tree summary file
-        if marker_set_id == 'bac120':
-            if levelopt is None:
-                symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
-                          os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
-            elif levelopt == 'high':
-                symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
-                          os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
-            elif levelopt == 'low':
-                symlink_f(PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix),
-                          os.path.join(out_dir, os.path.basename(
-                              PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix))))
-        elif marker_set_id == 'ar53':
-            symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
-                      os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
-        else:
-            self.logger.error('There was an error determining the marker set.')
-            raise GenomeMarkerSetUnknown
+        # if marker_set_id == 'bac120':
+        #     if levelopt is None:
+        #         symlink_f(PATH_BAC120_TREE_FILE.format(prefix=prefix),
+        #                   os.path.join(out_dir, os.path.basename(PATH_BAC120_TREE_FILE.format(prefix=prefix))))
+        #     elif levelopt == 'high':
+        #         symlink_f(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix),
+        #                   os.path.join(out_dir, os.path.basename(PATH_HIGH_BAC120_TREE_FILE.format(prefix=prefix))))
+        #     elif levelopt == 'low':
+        #         symlink_f(PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix),
+        #                   os.path.join(out_dir, os.path.basename(
+        #                       PATH_LOW_BAC120_TREE_FILE.format(iter=tree_iter, prefix=prefix))))
+        # elif marker_set_id == 'ar53':
+        #     symlink_f(PATH_AR53_TREE_FILE.format(prefix=prefix),
+        #               os.path.join(out_dir, os.path.basename(PATH_AR53_TREE_FILE.format(prefix=prefix))))
+        # else:
+        #     self.logger.error('There was an error determining the marker set.')
+        #     raise GenomeMarkerSetUnknown
 
         return tree_file
 
@@ -360,17 +360,27 @@ def run(self,
             if marker_set_id == 'ar53':
                 marker_summary_fh = CopyNumberFileAR53(align_dir, prefix)
                 marker_summary_fh.read()
-                user_msa_file = os.path.join(align_dir,
-                                             PATH_AR53_USER_MSA.format(prefix=prefix))
+                if os.path.isfile(os.path.join(align_dir,
+                                             PATH_AR53_USER_MSA.format(prefix=prefix))):
+                    user_msa_file = os.path.join(align_dir,
+                                                 PATH_AR53_USER_MSA.format(prefix=prefix))
+                else:
+                    user_msa_file = os.path.join(align_dir,
+                                                 PATH_AR53_USER_MSA.format(prefix=prefix)+'.gz')
                 summary_file = ClassifySummaryFileAR53(out_dir, prefix)
                 red_dict_file = REDDictFileAR53(out_dir, prefix)
                 disappearing_genomes_file = DisappearingGenomesFileAR53(out_dir, prefix)
                 pplacer_classify_file = PplacerClassifyFileAR53(out_dir, prefix)
             elif marker_set_id == 'bac120':
                 marker_summary_fh = CopyNumberFileBAC120(align_dir, prefix)
                 marker_summary_fh.read()
-                user_msa_file = os.path.join(align_dir,
-                                             PATH_BAC120_USER_MSA.format(prefix=prefix))
+                if os.path.isfile(os.path.join(align_dir,
+                                             PATH_BAC120_USER_MSA.format(prefix=prefix))):
+                    user_msa_file = os.path.join(align_dir,
+                                                 PATH_BAC120_USER_MSA.format(prefix=prefix))
+                else:
+                    user_msa_file = os.path.join(align_dir,
+                                                 PATH_BAC120_USER_MSA.format(prefix=prefix)+'.gz')
                 summary_file = ClassifySummaryFileBAC120(out_dir, prefix)
                 red_dict_file = REDDictFileBAC120(out_dir, prefix)
                 disappearing_genomes_file = DisappearingGenomesFileBAC120(out_dir, prefix)
@@ -396,8 +406,6 @@ def run(self,
 
             msa_dict = read_fasta(user_msa_file)
 
-
-
             if not fulltreeopt and marker_set_id == 'bac120':
                 splitter = Split(self.order_rank, self.gtdb_taxonomy, self.reference_ids)
                 # run pplacer to place bins in reference genome tree
@@ -531,7 +539,8 @@ def run(self,
                 tree_mapping_file.write()
 
             # Write the summary file to disk.
-            disappearing_genomes_file.write()
+            if disappearing_genomes_file.data:
+                disappearing_genomes_file.write()
             summary_file.write()
 
     def _generate_summary_file(self, marker_set_id, prefix, out_dir, debugopt=None, fulltreeopt=None):

diff --git a/gtdbtk/cli.py b/gtdbtk/cli.py
@@ -176,7 +176,7 @@ def __help(group):
 
 def __pplacer_cpus(group):
     group.add_argument('--pplacer_cpus', type=int, default=None,
-                       help='use ``pplacer_cpus`` during placement (default: ``cpus``)')
+                       help='number of CPUs to use during pplacer placement')
 
 
 def __scratch_dir(group):
@@ -264,13 +264,17 @@ def __mash_db(group):
 
 def __min_af(group):
     group.add_argument('--min_af', type=float, default=AF_THRESHOLD,
-                       help='minimum alignment fraction to consider closest genome')
+                       help='minimum alignment fraction to assign genome to a species cluster')
 
 
 def __untrimmed_msa(group, required):
     group.add_argument('--untrimmed_msa', type=str, default=None, required=required,
                        help="path to the untrimmed MSA file")
 
+def __keep_intermediates(group):
+    group.add_argument('--keep_intermediates', default=False, action='store_true',
+                       help='keep intermediate files in the final directory')
+
 
 def __output(group, required):
     group.add_argument('--output', type=str, default=None, required=required,
@@ -335,6 +339,7 @@ def get_main_parser():
             __cpus(grp)
             __force(grp)
             __temp_dir(grp)
+            __keep_intermediates(grp)
             __debug(grp)
             __help(grp)
 
@@ -346,6 +351,7 @@ def get_main_parser():
         with arg_group(parser, 'required named arguments') as grp:
             __out_dir(grp, required=True)
         with arg_group(parser, 'optional arguments') as grp:
+            __full_tree(grp)
             __extension(grp)
             __min_perc_aa(grp)
             __prefix(grp)
@@ -355,7 +361,7 @@ def get_main_parser():
             __force(grp)
             __scratch_dir(grp)
             #__recalculate_red(grp)
-            __full_tree(grp)
+            __keep_intermediates(grp)
             __min_af(grp)
             __temp_dir(grp)
             __debug(grp)

diff --git a/gtdbtk/config/config.py b/gtdbtk/config/config.py
@@ -281,7 +281,7 @@
 BAC_MARKER_COUNT = 120
 
 # Information about alignment Fraction to resolve fastANI results
-AF_THRESHOLD = 0.65
+AF_THRESHOLD = 0.5
 
 # MSA file names
 CONCAT_BAC120 = os.path.join(MSA_FOLDER, f"gtdb_{VERSION_DATA}_bac120.faa")
@@ -316,15 +316,15 @@
 MRCA_RED_AR53 = os.path.join(RED_DIR, f"gtdbtk_{VERSION_DATA}_ar53.tsv")
 
 # Hashing information for validating the reference package.
-REF_HASHES = {PPLACER_DIR: '4d931b5109a240602f55228029b87ee768da8141',
-              MASK_DIR: '36d6ac371d247b2b952523b9798e78908ea323fa',
-              MARKER_DIR: '2ba5ae35fb272462663651d18fd9e523317e48cd',
-              RADII_DIR: '9f9a2e21e27b9049044d04d731795499414a365c',
-              MSA_FOLDER: 'b426865245c39ee9f01b0392fb8f7867a9f76f0a',
-              METADATA_DIR: '7640aed96fdb13707a2b79b746a94335faabd6df',
-              TAX_FOLDER: '4a7a1e4047c088e92dee9740206499cdb7e5beca',
-              FASTANI_DIR: '70439cf088d0fa0fdbb4f47b4a6b47e199912139',
-              RED_DIR: 'ad6a184150e7b6e58547912660a17999fadcfbff'}
+REF_HASHES = {PPLACER_DIR: '20903925a856a58b102a7b0ce160c5cbd2cf675b',
+              MASK_DIR: '50e414a9de18170e8cb97f990f89ff60a0fe29d5',
+              MARKER_DIR: '163f542c3f0a40f59df45d453aa235b39aa96e27',
+              RADII_DIR: '8fd13b1c5d7a7b073ba96fb628581613b293a374',
+              MSA_FOLDER: '4bd032c90d5e5f0cbc96338445721a317f7d90b4',
+              METADATA_DIR: '9772fbeac1311b31e10293fa610eb33aa1ec8e15',
+              TAX_FOLDER: '6fb0233b05633242369b40c026fd1ee53e266afa',
+              FASTANI_DIR: '973c456c02f55bb82908a6811c7076e207e9b206',
+              RED_DIR: '7b8b67b3157204b470c9eb809d3c39c4effffabc'}
 
 # Config values for checking GTDB-Tk on startup.
 GTDBTK_VER_CHECK = True

diff --git a/gtdbtk/decorate.py b/gtdbtk/decorate.py
@@ -289,7 +289,6 @@ def _leaf_taxa(self, leaf):
 
             parent = parent.parent_node
 
-        print(leaf_taxa)
         ordered_taxa = leaf_taxa[::-1]
 
         # fill in missing ranks