Skip to content

Commit

Permalink
docs(Writing documentation for 2.2.0 release):
Browse files Browse the repository at this point in the history
  • Loading branch information
pchaumeil committed Feb 13, 2023
1 parent 4ea7ce1 commit bc00670
Show file tree
Hide file tree
Showing 12 changed files with 173 additions and 39 deletions.
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@
[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
[![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)

<b>GTDB-Tk v2.1.0+ requires an updated reference package ([R207_v2](https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz)), [read more](https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data).</b>

GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based
on the Genome Database Taxonomy ([GTDB](https://gtdb.ecogenomic.org/)). It is designed to work with recent advances that
allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples.
Expand Down Expand Up @@ -39,13 +37,17 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB

## ✨ New Features

GTDB-Tk v2.1.0 includes the following new features:
- GTDB-TK now uses a **divide-and-conquer** approach where the bacterial reference tree is split into multiple **class**-level subtrees. This reduces the memory requirements of GTDB-Tk from **320 GB** of RAM when using the full GTDB R07-RS207 reference tree to approximately **55 GB**. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the `--full-tree` flag.
This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (See [#383](https://github.com/Ecogenomics/GTDBTk/issues/383)).
- Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the `gtdbtk.bac120.summary.tsv` as 'Unclassified'
- Genomes filtered out during the alignment step are now reported in the `gtdbtk.bac120.summary.tsv` or `gtdbtk.ar53.summary.tsv` as 'Unclassified Bacteria/Archaea'
- `--write_single_copy_genes` flag in now available in the `classify_wf` and `de_novo_wf` workflows.

GTDB-Tk v2.2.0+ includes the following new features:
- GTDB-TK `classify` and `classify_wf` have changed in version 2.2.0+.There is an additional step before the final classification.
- **This is now the default behavior for `classify` and `classify_wf`.**
- In `classify`, user genomes are first compared against a mash database comprised of all GTDB representative genomes and, secondly, the best hits are then verified using FastANI. User genomes classified with FastANI are not run through pplacer.
- In `classify_wf`, before the identify step, user genomes are first compared against a mash database comprised of all GTDB representative genomes and, secondly, the best hits are then verified using FastANI.User genomes classified with FastANI are not run through the rest of the pipeline(identify, align, classify).
- To classify genomes without the additional `ani_screen` step, use the `--skip_ani_screen` flag.

## 📈 Performance
Using ANI screen "can" reduce computation by >50%, although it definitely depends on the input genomes. a set of input genomes with a lot of new species will not benefit from ANI screen
as much as a set of genomes with a lot of known species. In the latter case, the ANI screen will reduce the number of genomes that need to be classified by pplacer and will reduce the computational time
subsantially ( between 25% and 60% in our testing).

## 📚 References

Expand Down
4 changes: 2 additions & 2 deletions docs/src/announcements.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ Announcements
GTDB-Tk 2.2.0 available
-----------------------

*February XX, 2023*
*February 14, 2023*

* GTDB-Tk version ``2.2.0`` is now available.
* This version of GTDB-Tk does not require a new version of the GTDB-Tk reference package
* This version of GTDB-Tk **does not** require a new version of the GTDB-Tk reference package.


GTDB-Tk 2.1.0 available
Expand Down
10 changes: 8 additions & 2 deletions docs/src/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,24 @@
Change log
==========

2.2.x
2.2.0
-----

Minor changes:

* (`#433 <https://github.com/Ecogenomics/GTDBTk/issues/433>`_) Added additional checks to ensure that the `--outgroup_taxon` cannot be set to a domain (`root`, `de_novo_wf`).
* (`#459 <https://github.com/Ecogenomics/GTDBTk/issues/459>`_ / `#462 <https://github.com/Ecogenomics/GTDBTk/issues/462>`_ )
* (`#459 <https://github.com/Ecogenomics/GTDBTk/issues/459>`_ / `#462 <https://github.com/Ecogenomics/GTDBTk/issues/462>`_ ) Fix deprecated np.bool in prodigal_biolib.py. Special thanks to @neoformit for his contribution.
* (`#466 <http://github.com/Ecogenomics/GTDBTk/issues/466>`_) RED value has been rounded to 5 decimals after the coma.
* (`#451 <http://github.com/Ecogenomics/GTDBTk/issues/451>`_) Extra checks have been added when Prodigal fails.
* (`#448 <http://github.com/Ecogenomics/GTDBTk/issues/448>`_) Warning has been added when all the genomes are filtered out and not classified.

Bug Fixes:

* (`#420 <https://github.com/Ecogenomics/GTDBTk/issues/420>`_) Fixed an issue where GTDB-Tk might hang when classifying TIGRFAM markers (`identify`, `classify_wf`, `de_novo_wf`). Special thanks to @lfenske-93 and @sjaenick for their contribution.
* (`#428 <https://github.com/Ecogenomics/GTDBTk/issues/428>`_) Fixed an issue where the `--gtdbtk_classification_file` would raise an error trying to read the `classify` summary (`root`, `de_novo_wf`).
* (`#439 <https://github.com/Ecogenomics/GTDBTk/issues/439>`_) Fix the pipeline when using protein files instead of nucleotide files. symlink uses absolute path instead.




2.1.1
Expand Down
12 changes: 7 additions & 5 deletions docs/src/commands/align.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,14 @@ Files output


* :ref:`[prefix].log <files/gtdbtk.log>`
* :ref:`[prefix].json <files/gtdbtk.json>`
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
* :ref:`align/[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
* :ref:`align/[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
* :ref:`align/[prefix].[domain].filtered.tsv <files/filtered.tsv>`
* :ref:`align/intermediate_results/[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`

* align
* :ref:`[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
* :ref:`[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
* :ref:`[prefix].[domain].filtered.tsv <files/filtered.tsv>`
* intermediate_results
* :ref:`[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`

Example
-------
Expand Down
62 changes: 44 additions & 18 deletions docs/src/commands/classify.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,33 @@ Files output
------------

* classify
* :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
* :ref:`[prefix].backbone.[domain].classify.tree <files/classify.tree>`
* :ref:`[prefix].[domain].tree.mapping.tsv <files/tree.mapping.tsv>`
* :ref:`[prefix].[domain].classify.tree.[index].tree <files/classify.tree>`
* intermediate_results
* :ref:`[prefix].[domain].classification_pplacer.tsv <files/classification_pplacer.tsv>`
* :ref:`[prefix].[domain].classify.tree <files/classify.tree>`
* :ref:`[prefix].[domain].backbone.classification_pplacer.tsv <files/classification_pplacer.tsv>`
* :ref:`[prefix].[domain].class_level.classification_pplacer_tree_[index].tsv <files/classification_pplacer.tsv>`
* :ref:`[prefix].[domain].prescreened.msa.fasta <files/msa.fasta>`
* :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
* pplacer
* :ref:`pplacer.[domain].json <files/pplacer.domain.json>`
* :ref:`pplacer.[domain].out <files/pplacer.domain.out>`
* :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
* :ref:`pplacer.backbone.[domain].json <files/pplacer.domain.json>`
* :ref:`pplacer.backbone.[domain].out <files/pplacer.domain.out>`
* tree_[index]
* :ref:`[prefix].[domain].user_msa.fasta <files/user_msa.fasta>`
* :ref:`pplacer.class_level.[domain].out <files/pplacer.domain.out>`
* :ref:`pplacer.class_level.[domain].json <files/pplacer.domain.json>`
* ani_screen
* intermediate_results
* mash
* :ref:`[prefix].mash_distances.tsv <files/mash_distances.tsv>`
* :ref:`[prefix].user_query_sketch.msh <files/user_query_sketch.msh>`
* :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
* :ref:`[prefix].log <files/gtdbtk.log>`
* :ref:`[prefix].json <files/gtdbtk.json>`
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`


Example
-------

Expand All @@ -51,16 +67,26 @@ Output

.. code-block:: text
[2022-04-11 12:02:06] INFO: GTDB-Tk v2.0.0
[2022-04-11 12:02:06] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
[2022-04-11 12:02:06] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
[2022-04-11 12:02:07] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 2 CPUs (be patient).
[2022-04-11 12:02:07] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2022-04-11 12:07:06] INFO: Calculating RED values based on reference tree.
[2022-04-11 12:07:06] TASK: Traversing tree to determine classification method.
[2022-04-11 12:07:06] INFO: Completed 2 genomes in 0.00 seconds (18,558.87 genomes/second).
[2022-04-11 12:07:06] TASK: Calculating average nucleotide identity using FastANI (v1.32).
[2022-04-11 12:07:08] INFO: Completed 4 comparisons in 1.61 seconds (2.49 comparisons/second).
[2022-04-11 12:07:08] INFO: 2 genome(s) have been classified using FastANI and pplacer.
[2022-04-11 12:07:08] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2022-04-11 12:07:08] INFO: Done.
[2023-02-08 12:53:42] INFO: GTDB-Tk v2.2.0
[2023-02-08 12:53:42] INFO: gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
[2023-02-08 12:53:42] INFO: Using GTDB-Tk reference data version r207: /path/to/gtdbtk/database/release207_v2/
[2023-02-08 12:53:43] INFO: Loading reference genomes.
[2023-02-08 12:53:43] INFO: Using Mash version 2.2.2
[2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: 3classify_ani/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
[2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: mash_db_dir/gtdb_ref_sketch.msh
[2023-02-08 12:53:46] INFO: Calculating Mash distances.
[2023-02-08 12:53:49] INFO: Calculating ANI with FastANI v1.3.
[2023-02-08 12:53:49] INFO: Completed 12 comparisons in 0.44 seconds (27.54 comparisons/second).
[2023-02-08 12:53:49] INFO: 2 genome(s) have been classified using the ANI pre-screening step.
[2023-02-08 12:53:49] TASK: Placing 1 bacterial genomes into backbone reference tree with pplacer using 20 CPUs (be patient).
[2023-02-08 12:53:49] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
[2023-02-08 12:55:02] INFO: Calculating RED values based on reference tree.
[2023-02-08 12:55:03] INFO: 1 out of 1 have an class assignments. Those genomes will be reclassified.
[2023-02-08 12:55:03] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (1/1) with pplacer using 20 CPUs (be patient).
[2023-02-08 12:57:38] INFO: Calculating RED values based on reference tree.
[2023-02-08 12:57:40] TASK: Traversing tree to determine classification method.
[2023-02-08 12:57:40] INFO: Completed 1 genome in 0.04 seconds (23.86 genomes/second).
[2023-02-08 12:57:40] INFO: 0 genome(s) have been classified using FastANI and pplacer.
[2023-02-08 12:57:40] WARNING: 1 of 3 genome has a warning (see summary file).
[2023-02-08 12:57:40] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
[2023-02-08 12:57:40] INFO: Done.
6 changes: 5 additions & 1 deletion docs/src/commands/classify_wf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ For arguments and output files, see each of the individual steps:
* :ref:`commands/align`
* :ref:`commands/classify`

The classify workflow consists of three steps: ``identify``, ``align``, and ``classify``.
The classify workflow consists of four steps: ``ani_screen``, ``identify``, ``align``, and ``classify``.

The ``ani_screen`` step compares user genomes against a `Mash <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_ database composed of all GTDB representative genomes,
then verify the best mash hits using `FastANI <https://www.nature.com/articles/s41467-018-07641-9>`_. User genomes classified with FastANI are not run through the rest of the pipeline (``identify``, ``align``, ``classify``)
and are reported in the summary file.

The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
and uses HMM models and the `HMMER <http://hmmer.org/>`_ package to identify the
Expand Down
22 changes: 22 additions & 0 deletions docs/src/commands/failed_genomes.tsv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
.. _files/failed_genomes.tsv:

failed.genomes.tsv
===================

File reporting failed genomes which have been excluded from analysis due to Prodigal failing to call any genes.

Produced by
-----------
* :ref:`commands/identify`
* :ref:`commands/classify_wf`

Example
-------

.. code-block:: text
GCA_000002165.1,No genes were called by Prodigal
GCA_000002175.1,No genes were called by Prodigal
GCA_000002185.1,No genes were called by Prodigal
GCA_000002195.1,No genes were called by Prodigal
GCA_000002205.1,No genes were called by Prodigal
6 changes: 4 additions & 2 deletions docs/src/commands/identify.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,13 @@ Arguments
## Files output

* :ref:`[prefix].log <files/gtdbtk.log>`
* :ref:`[prefix].json <files/gtdbtk.json>`
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
* identify/
* identify
* :ref:`[prefix].[domain].markers_summary.tsv <files/markers_summary.tsv>`
* :ref:`[prefix].translation_table_summary.tsv <files/translation_table_summary.tsv>`
* identify/intermediate_results/marker_genes/[genome_id]/
* :ref:`[prefix].failed_genomes.tsv <files/failed_genomes.tsv>`
* intermediate_results/marker_genes/[genome_id]/
* :ref:`[genome_id]_pfam_tophit.tsv <files/pfam_tophit.tsv>`
* :ref:`[genome_id]_pfam.tsv <files/pfam.tsv>`
* :ref:`[genome_id]_protein.faa <files/protein.faa>`
Expand Down
46 changes: 46 additions & 0 deletions docs/src/files/gtdbtk.json.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
.. _files/gtdbtk.json:

gtdbtk.json
==========

The console output of GTDB-Tk saved to disk in a JSON format.

Produced by
-----------

* :ref:`commands/align`
* :ref:`commands/align`
* :ref:`commands/classify`
* :ref:`commands/classify_wf`
* :ref:`commands/de_novo_wf`
* :ref:`commands/identify`
* :ref:`commands/infer`

Example
-------

.. code-block:: text
{
"version": "2.1.1",
"command_line": "gtdbtk classify_wf --batchfile /srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv --out_dir /srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/ --keep_intermediates --cpus 20 --mash_db /srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/",
"database_version": "r207",
"database_path": "/srv/projects/gtdbtk/test_new_features/release207_v2/",
"steps": [
{
"name": "ANI screen",
"input": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv",
"output_dir": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/",
"output_files": {
"bac120": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv"
},
"starts_at": "2023-02-01T08:02:17.814231",
"ends_at": "2023-02-01T08:02:27.782442",
"duration": "0:00:09",
"status": "completed",
"mash_k": 16,
"mash_s": 5000,
"mash_v": 1.0,
"mash_max_dist": 0.1,
"mash_db": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/"
},
2 changes: 2 additions & 0 deletions docs/src/files/mash_distances.msh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ The raw output of the distance from each genome to the GTDB-Tk reference genomes
Produced by
-----------
* :ref:`commands/ani_rep`
* :ref:`commands/classify`
* :ref:`commands/classify_wf`

Example
-------
Expand Down
20 changes: 20 additions & 0 deletions docs/src/files/tree_mapping.tsv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. _files/tree_mapping.tsv:

tree_mapping.tsv
==================

Map between genomes and the class_level taxonomy tree used to classify them.

Produced by
-----------
* :ref:`commands/classify_wf`
* :ref:`commands/classify`


Example
-------

.. code-block:: text
user_genome is_ani_classification class_tree_mapped classification_rule
3300006853_26 False 5 Rule 3
2 changes: 2 additions & 0 deletions docs/src/files/user_query_sketch.msh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ Produced by
-----------

* :ref:`commands/ani_rep`
* :ref:`commands/classify`
* :ref:`commands/classify_wf`

0 comments on commit bc00670

Please sign in to comment.