Skip to content

Commit

Permalink
feat(rewrite documentation for skani, modify dockerfile):
Browse files Browse the repository at this point in the history
  • Loading branch information
pchaumeil committed Mar 26, 2024
1 parent a063d02 commit 7af0d56
Show file tree
Hide file tree
Showing 10 changed files with 50 additions and 32 deletions.
22 changes: 19 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

FROM python:3.8-slim-bullseye

ARG SKANI_VER="0.2.1"

ARG VER

# ---------------------------------------------------------------------------- #
Expand All @@ -15,6 +17,7 @@ RUN apt-get update -y -m && \
libgomp1 \
libgsl25 \
libgslcblas0 \
cargo \
hmmer=3.* \
mash=2.2.* \
prodigal=1:2.6.* \
Expand All @@ -36,9 +39,22 @@ RUN wget https://github.com/matsen/pplacer/releases/download/v1.1.alpha19/pplace
# ---------------------------------------------------------------------------- #
# ----------------------------- INSTALL FASTANI ------------------------------ #
# ---------------------------------------------------------------------------- #
RUN wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip -q && \
unzip fastANI-Linux64-v1.32.zip -d /usr/bin && \
rm fastANI-Linux64-v1.32.zip
#RUN wget https://github.com/ParBLiSS/FastANI/releases/download/v1.32/fastANI-Linux64-v1.32.zip -q && \
# unzip fastANI-Linux64-v1.32.zip -d /usr/bin && \
# rm fastANI-Linux64-v1.32.zip

# ---------------------------------------------------------------------------- #
# ------------------------------ INSTALL SKANI ------------------------------- #
# ---------------------------------------------------------------------------- #

ARG SKANI_VER

RUN wget https://github.com/bluenote-1577/skani/archive/refs/tags/v${SKANI_VER}.tar.gz &&\
tar -xvf v${SKANI_VER}.tar.gz &&\
cd skani-${SKANI_VER} &&\
cargo install --path . --root ~/.cargo &&\
chmod +x /root/.cargo/bin/skani &&\
rm v${SKANI_VER}.tar.gz

# ---------------------------------------------------------------------------- #
# --------------------- SET GTDB-TK MOUNTED DIRECTORIES ---------------------- #
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ We strongly encourage you to cite the following 3rd party dependencies:

* Matsen FA, et al. 2010. [pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree](https://www.ncbi.nlm.nih.gov/pubmed/21034504). <i>BMC Bioinformatics</i>, 11:538.
* Jain C, et al. 2019. [High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries](https://www.nature.com/articles/s41467-018-07641-9). <i>Nat. Communications</i>, doi: 10.1038/s41467-018-07641-9.
* Shaw J. and Yu Y.W. 2023. [Fast and robust metagenomic sequence comparison through sparse chaining with skani](https://www.nature.com/articles/s41592-023-02018-3). <i>Nature Methods</i>, 20, pages1661–1665 (2023).
* Hyatt D, et al. 2010. [Prodigal: prokaryotic gene recognition and translation initiation site identification](https://www.ncbi.nlm.nih.gov/pubmed/20211023). <i>BMC Bioinformatics</i>, 11:119. doi: 10.1186/1471-2105-11-119.
* Price MN, et al. 2010. [FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/). <i>PLoS One</i>, 5, e9490.
* Eddy SR. 2011. [Accelerated profile HMM searches](https://www.ncbi.nlm.nih.gov/pubmed/22039361). <i>PLOS Comp. Biol.</i>, 7:e1002195.
Expand Down
4 changes: 2 additions & 2 deletions docs/src/commands/check_install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Output
[2020-11-04 09:35:16] INFO: Checking that all third-party software are on the system path:
[2020-11-04 09:35:16] INFO: |-- FastTree OK
[2020-11-04 09:35:16] INFO: |-- FastTreeMP OK
[2020-11-04 09:35:16] INFO: |-- fastANI OK
[2020-11-04 09:35:16] INFO: |-- skani OK
[2020-11-04 09:35:16] INFO: |-- guppy OK
[2020-11-04 09:35:16] INFO: |-- hmmalign OK
[2020-11-04 09:35:16] INFO: |-- hmmsearch OK
Expand All @@ -57,6 +57,6 @@ Output
[2020-11-04 09:35:20] INFO: |-- msa OK
[2020-11-04 09:35:20] INFO: |-- metadata OK
[2020-11-04 09:35:20] INFO: |-- taxonomy OK
[2020-11-04 09:47:36] INFO: |-- fastani OK
[2020-11-04 09:47:36] INFO: |-- skani OK
[2020-11-04 09:47:36] INFO: |-- mrca_red OK
[2020-11-04 09:47:36] INFO: Done.
2 changes: 1 addition & 1 deletion docs/src/commands/classify_wf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ For arguments and output files, see each of the individual steps:
The classify workflow consists of four steps: ``ani_screen``, ``identify``, ``align``, and ``classify``.

The ``ani_screen`` step compares user genomes against a `Mash <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_ database composed of all GTDB representative genomes,
then verify the best mash hits using `FastANI <https://www.nature.com/articles/s41467-018-07641-9>`_. User genomes classified with FastANI are not run through the rest of the pipeline (``identify``, ``align``, ``classify``)
then verify the best mash hits using `skani <https://www.nature.com/articles/s41592-023-02018-3>`_. User genomes classified with FastANI are not run through the rest of the pipeline (``identify``, ``align``, ``classify``)
and are reported in the summary file.

The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
Expand Down
28 changes: 15 additions & 13 deletions docs/src/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,15 +75,26 @@ Using the ``--scratch_dir`` parameter and ``--pplacer_cpus 1`` may help.
How is GTDB-Tk validating species assignments using average nucleotide identity?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

GTDB-Tk uses `FastANI <https://github.com/ParBLiSS/FastANI>`_ to estimate the ANI between genomes.
We recommend you have FastANI >= 1.32 as this version introduces a fix that makes the results deterministic.
GTDB-Tk uses `skani <https://github.com/bluenote-1577/skani>`_ ( it was using fastANI until v2.3.2) to estimate the ANI between genomes.
A query genome is only classified as belonging to the same species as a reference genome if the ANI between the
genomes is within the species ANI circumscription radius (typically, 95%) and the alignment fraction (AF) is >=0.5.
In some circumstances, the phylogenetic placement of a query genome may not support the species assignment.
GTDB r207 strictly uses ANI to circumscribe species and GTDB-Tk follows this methodology.
GTDB r207+ strictly uses ANI to circumscribe species and GTDB-Tk follows this methodology.
The species-specific ANI circumscription radii are available from the `GTDB <https://gtdb.ecogenomic.org/>`_ website.


What is the difference between the mutually exclusive options ``--mash_db`` and ``--skip_ani_screen``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

| Starting with GTDB-Tk v2.2+, the ``classify_wf`` and ``classify`` function require an extra parameter to run: ``--mash_db`` or ``--skip_ani_screen``.
| With this new version of Tk, The first stage of ``classify`` pipelines (``classify_wf`` and ``classify``) is to compare all user genomes to all reference genomes and annotate them, if possible, based on ANI matches.
| Using the ``--mash_db`` option will indicate to GTDB-Tk the path of the sketched Mash database require for ANI screening.
| If no database are available ( i.e. this is the first time running classify ), the ``--mash_db`` option will sketch a new Mash database that can be used for subsequent calls.
| The ``--skip_ani_screen`` option will skip the pre-screening step and classify all genomes similar to previous versions of GTDB-Tk.
Deprecated FAQ
---------------

Why is FastANI using more threads than allocated?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -108,13 +119,4 @@ From GTDB-Tk v2.0.0 the conda environment will automatically have FastANI v1.3 i

From GTDB-Tk v2.2.2 the Docker container will automatically have FastANI v1.32 installed. Otherwise, manually
build the container from the `Dockerfile <https://github.com/Ecogenomics/GTDBTk/blob/master/Dockerfile>`_, making
sure to specify FastANI v1.32.

What is the difference between the mutually exclusive options ``--mash_db`` and ``--skip_ani_screen``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

| Starting with GTDB-Tk v2.2+, the ``classify_wf`` and ``classify`` function require an extra parameter to run: ``--mash_db`` or ``--skip_ani_screen``.
| With this new version of Tk, The first stage of ``classify`` pipelines (``classify_wf`` and ``classify``) is to compare all user genomes to all reference genomes and annotate them, if possible, based on ANI matches.
| Using the ``--mash_db`` option will indicate to GTDB-Tk the path of the sketched Mash database require for ANI screening.
| If no database are available ( i.e. this is the first time running classify ), the ``--mash_db`` option will sketch a new Mash database that can be used for subsequent calls.
| The ``--skip_ani_screen`` option will skip the pre-screening step and classify all genomes similar to previous versions of GTDB-Tk.
sure to specify FastANI v1.32.
10 changes: 5 additions & 5 deletions docs/src/files/summary.tsv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ Classifications provided by the GTDB-Tk are in the files \<prefix>.bac120.summar

* user_genome: Unique identifier of query genome taken from the FASTA file of the genome.
* classification: GTDB taxonomy string inferred by the GTDB-Tk. An unassigned species (i.e., ``s__``) indicates that the query genome is either i) placed outside a named genus or ii) the ANI to the closest intra-genus reference genome with an AF >=0.65 is not within the species-specific ANI circumscription radius.
* fastani_reference: indicates the accession number of the reference genome (species) to which a user genome was assigned based on ANI and AF. ANI values are only calculated when a query genome is placed within a defined genus and are evaluated for all reference genomes in that genus.
* fastani_reference_radius: indicates the species-specific ANI circumscription radius of the reference genomes used to determine if a query genome should be classified to the same species as the reference.
* fastani_taxonomy: indicates the GTDB taxonomy of the above reference genome.
* fastani_ani: indicates the ANI between the query and above reference genome.
* fastani_af: indicates the alignment fraction (AF) between the query and above reference genome.
* gtdb_reference: indicates the accession number of the reference genome (species) to which a user genome was assigned based on ANI and AF. ANI values are only calculated when a query genome is placed within a defined genus and are evaluated for all reference genomes in that genus.
* gtdb_reference_radius: indicates the species-specific ANI circumscription radius of the reference genomes used to determine if a query genome should be classified to the same species as the reference.
* gtdb_taxonomy: indicates the GTDB taxonomy of the above reference genome.
* gtdb_ani: indicates the ANI between the query and above reference genome.
* gtdb_af: indicates the alignment fraction (AF) between the query and above reference genome.
* closest_placement_reference: indicates the accession number of the reference genome when a genome is placed on a terminal branch.
* closest_placement_taxonomy: indicates the GTDB taxonomy of the above reference genome.
* closest_placement_ani: indicates the ANI between the query and above reference genome.
Expand Down
6 changes: 3 additions & 3 deletions docs/src/installing/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,9 @@ GTDB-Tk makes use of the following 3rd party dependencies and assumes they are o
* - `pplacer <http://matsen.fhcrc.org/pplacer/>`_
- >= 1.1
- Matsen FA, et al. 2010. `pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree <https://www.ncbi.nlm.nih.gov/pubmed/21034504>`_. *BMC Bioinformatics*, 11:538.
* - `FastANI <https://github.com/ParBLiSS/FastANI>`_
- >= 1.32
- Jain C, et al. 2019. `High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries <https://www.nature.com/articles/s41467-018-07641-9>`_. *Nat. Communications*, doi: 10.1038/s41467-018-07641-9.
* - `skani <https://github.com/bluenote-1577/skani/>`_
- >= 0.2.1
- Shaw J. and Yu Y.W. 2023. `Fast and robust metagenomic sequence comparison through sparse chaining with skani <https://www.nature.com/articles/s41592-023-02018-3>`_. *Nature Methods*, 20, pages1661–1665 (2023).
* - `FastTree <http://www.microbesonline.org/fasttree/>`_
- >= 2.1.9
- Price MN, et al. 2010. `FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/>`_. *PLoS One*, 5, e9490.
Expand Down
6 changes: 3 additions & 3 deletions docs/src/references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,14 @@ References
- Eddy SR. 2011. `Accelerated profile HMM searches <https://www.ncbi.nlm.nih.gov/pubmed/22039361>`_. *PLOS Comp. Biol.*, 7:e1002195.
* - `pplacer <http://matsen.fhcrc.org/pplacer/>`_
- Matsen FA, et al. 2010. `pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree <https://www.ncbi.nlm.nih.gov/pubmed/21034504>`_. *BMC Bioinformatics*, 11:538.
* - `FastANI <https://github.com/ParBLiSS/FastANI>`_
- Jain C, et al. 2019. `High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries <https://www.nature.com/articles/s41467-018-07641-9>`_. *Nat. Communications*, doi: 10.1038/s41467-018-07641-9.
* - `skani <https://github.com/bluenote-1577/skani>`_
- Shaw J. and Yu Y.W. 2023. `Fast and robust metagenomic sequence comparison through sparse chaining with skani <https://www.nature.com/articles/s41592-023-02018-3>`_. *Nature Methods*, 20, pages1661–1665 (2023).
* - `FastTree <http://www.microbesonline.org/fasttree/>`_
- Price MN, et al. 2010. `FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/>`_. *PLoS One*, 5, e9490.
* - `Mash <https://github.com/marbl/Mash>`_
- Ondov BD, et al. 2016. `Mash: fast genome and metagenome distance estimation using MinHash <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_. *Genome Biol* 17, 132. doi: doi: 10.1186/s13059-016-0997-x.
* - `DendroPy <https://dendropy.org/>`_
- Sukumaran, J. and Mark T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571.
- Sukumaran J. and Mark T. Holder. 2010. DendroPy: A Python library for phylogenetic computing. Bioinformatics 26: 1569-1571.
* - `NumPy <https://numpy.org/>`_
- Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020). DOI: `0.1038/s41586-020-2649-2 <https://doi.org/10.1038/s41586-020-2649-2>`_
* - `tqdm <https://github.com/tqdm/tqdm>`_
Expand Down
1 change: 0 additions & 1 deletion gtdbtk/external/skani.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ def __init__(self, cpus, force_single):
self.force_single = force_single
self.logger = logging.getLogger('timestamp')
self.version = self._get_version()
self.minFrac = self._isMinFrac_present()
self._suppress_v1_warning = False

@staticmethod
Expand Down
2 changes: 1 addition & 1 deletion tests/test_gtdbtk/test_external/test_skani.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def tearDown(self):
shutil.rmtree(self.dir_tmp)

def test_run(self):
"""Test that FastANI produces the expected output (version dependent)"""
"""Test that skani produces the expected output (version dependent)"""

fa = SkANI(self.cpus, force_single=True)

Expand Down

0 comments on commit 7af0d56

Please sign in to comment.