NanoASV

NanoASV is a conda environment snakemake based workflow using state of the art bioinformatic softwares to process full-length SSU rRNA (16S/18S) amplicons acquired with Oxford Nanopore Sequencing technology. Its strength lies in reproducibility, portability and the possibility to run offline. It can be installed on the Nanopore MK1C sequencing device and process data locally.

Options

Usage: nanoasv -d path/to/dir -o path/to/output [--options]

| Option               | Description                                                          |
| -------------------- | ---------------------------------------------------------------------|
| `-h`, `--help`       | Show help message                                                    |
| `-v`, `--version`    | Show version information                                             |
| `-d`, `--dir`        | Path to demultiplexed barcodes                                       |           
| `-o`, `--out`        | Path for output directory                                            |
| `-db`, `--database`  | Path to reference fasta file                                         |
| `-q`, `--quality`    | Quality threshold for Chopper, default: 8                            |
| `-l`, `--minlength`  | Minimum amplicon length for Chopper, default: 1300                   |
| `-L`, `--maxlength`  | Maximum amplicon length for Chopper, default: 1700                   |
| `-i`, `--id-vsearch` | Identity threshold for vsearch clustering step, default: 0.7         |
| `-ab`, `--minab`     | Minimum unknown cluster total abundance to be kept                   |
| `-p`, `--num-process`| Number of cores for parallelization, default: 1                      |
| `--subsampling`      | Max number of sequences per barcode, default: 50,000                 |
| `--no-r-cleaning`    | Flag - to keep Eukaryota, Chloroplast, and Mitochondria sequences    |
|                      | from phyloseq object                                                 |
| `--metadata`         | Specify metadata.csv file directory, default is --dir                |
| `--notree`           | Flag - To remove phylogeny step and tree from phyloseq object        |
| `--sam-qual`         | To tune samtools filtering quality threshold, default: 30            |
| `--requirements`     | Flag - To display personal reference fasta requirements              |
| `--dry-run`          | Flag - NanoASV Snakemake dry run                                     |
| `--mock`             | Flag - Run mock dataset with NanoASV                                 |
| `--remove-tmp`       | Remove tmp data after execution. No snakemake resume option if set.  |

Installation with Conda

(to install NanoASV on Oxford Nanopore MK1C sequencing devices, see section ONT MK1C Installation)

Clone the repository from github:

cd ${HOME}
git clone https://github.com/ImagoXV/NanoASV.git

Run the installation script:

bash ${HOME}/NanoASV/config/install.sh

Then activate the environment:

conda activate NanoASV

Don't forget to activate the environment before running nanoasv. It will not work otherwise.

Database setup

NanoASV can be used with any reference fasta file. If you want to have a broad idea of your community taxonomy, we recommend you to use latest Silva.

Download the database and put it in ./resources/:

RELEASE=138.2
URL="https://www.arb-silva.de/fileadmin/silva_databases/release_${RELEASE}/Exports"
INPUT="SILVA_${RELEASE}_SSURef_NR99_tax_silva.fasta.gz"
OUTPUT="SINGLELINE_${INPUT/_NR99/}"
FOLDER="${HOME}/NanoASV/resources"

mkdir -p "${FOLDER}"

echo "downloading and formating SILVA reference, this will take a few minutes."
wget --output-document - "${URL}/${INPUT}" | \
    gunzip --stdout | \
    awk '/^>/ {printf("%s%s\n", (NR == 1) ? "" : RS, $0) ; next} {printf("%s", $0)} END {printf("\n")}' | \
    gzip > "${FOLDER}/${OUTPUT}"

unset RELEASE URL INPUT OUTPUT

Test your installation

With a dry run

nanoasv --dry-run

With mock dataset

nanoasv --mock

You can inspect NanoASV's output structure in ./Mock_run_OUPUT/.

ONT MK1C Installation

You need to use the aarch64-MK1C branch, otherwise, it will not work.

You first need to install miniconda. Note that /data/ will be used for installation for storage capacity matters.

mkdir -p /data/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O /data/miniconda3/miniconda.sh --no-check-certificate
bash /data/miniconda3/miniconda.sh -b -u -p /data/miniconda3
rm /data/miniconda3/miniconda.sh
source /data/miniconda3/bin/activate

Then proceed to install conda.

Chopper needs to be Aarch64 compiled. Therefore, you need to download this specific archive or a newer one if someone cross-compiles it.

Warning, don't setup NanoASV environment from the conda (base) environment. Otherwise you'll run into issues.

cd /data/
git clone \
    --branch aarch64-MK1C-conda \
    --single-branch https://github.com/ImagoXV/NanoASV.git
cd ./NanoASV/
conda deactivate
conda env create -f environment.yml
(
    cd ./config/
    wget https://github.com/wdecoster/chopper/releases/download/v0.7.0/chopper-aarch64.zip
    unzip chopper-aarch64.zip
)
ROOT_DIR="$(conda env list | grep -w 'NanoASV' | awk '{print $2}')"
ACTIVATE_DIR="${ROOT_DIR}/etc/conda/activate.d"
cp ./config/{alias,paths}.sh ${ACTIVATE_DIR}/
echo "export NANOASV_PATH=$(pwd)" >> ${ACTIVATE_DIR}/paths.sh
DEACTIVATE_DIR="${ROOT_DIR}/etc/conda/deactivate.d"
cp ./config/unalias.sh ${DEACTIVATE_DIR}/
chmod +x ./workflow/run.sh

Database setup

NanoASV can be used with any reference fasta file. If you want to have a broad idea of your community taxonomy, we recommend you to use latest Silva.

Download the database and put it in ./resources/:

RELEASE=138.2
URL="https://www.arb-silva.de/fileadmin/silva_databases/release_${RELEASE}/Exports"
INPUT="SILVA_${RELEASE}_SSURef_NR99_tax_silva.fasta.gz"
OUTPUT="SINGLELINE_${INPUT/_NR99/}"
FOLDER="resources"

mkdir -p "${FOLDER}"

echo "downloading and formating SILVA reference, this will take a few minutes."
wget --output-document - "${URL}/${INPUT}" | \
    gunzip --stdout | \
    awk '/^>/ {printf("%s%s\n", (NR == 1) ? "" : RS, $0) ; next} {printf("%s", $0)} END {printf("\n")}' | \
    gzip > "./${FOLDER}/${OUTPUT}"

unset RELEASE URL INPUT OUTPUT

R environment installation

conda create -y --name R-phyloseq -c bioconda -c conda-forge bioconductor-phyloseq
conda activate R-phyloseq
Rscript -e 'install.packages("dplyr", repos = "https://cran.r-project.org")'
conda deactivate

Test run on MK1C device

conda activate NanoASV
nanoasv --mock

How NanoASV works

Data preparation

Directly input your /path/to/sequence/data/fastq_pass directory 4000 sequences fastq.gz files are concatenated by barcode identity to make one barcodeXX.fastq.gz file.

About single barcode experiment

Note that NanoASV is able to handle a single barcode experiment. You need to put your fastq file in a directory named barcode01 and it still needs a metadata file with two lines (header and data). Results will be produced as usual. A one sample phyloseq object and a classical CSV for quick lookup.

Filtering

Chopper will filter for inappropriate sequences. Is executed in parallel (default --num-process 1) Default parameters will filter for sequences with quality > 8 and 1300bp < length < 1700bp

Chimera detection

There is no efficient chimera detection step at the moment.

Adapter trimming

Porechop will trim known adapters Is executed in parallel (default --num-process 1)

Subsampling

50 000 sequences per barcode is enough for most common questions. Default is set to 50 000 sequences per barcode. Can be modified with --subsampling int

Alignment

minimap2 will align previously filtered sequences against the reference dataset (SILVA 138.2 by default) Can be executed in parallel (default --num-process 1). Default minimap2 alignement model is map-ont. This can be changed with --model option. Avalaible options are map-ont map-hifi map-pb asm5 asm10 asm20 splice splice:hq ava-pb ava-ont. Changing minimap2 model will have heavy consequences on your treatment. We recommend you to check NanoASV Supplementary materials. Individual barcode abundance: barcode*_abundance.tsv, Taxonomy: Taxonomy_barcode*.csv and barcode*_exact_affiliations.tsv files are produced by NanoASV. Those files are then used to create the final phyloseq object. Those files can be found in the ./Results/ directory.

Unknown sequences clustering

Non matching sequences fastq are extracted then clustered with vsearch (default --id 0.7). Clusters with abundance under 5 are discarded to avoid useless heavy computing. Outputs into ./Results/Unknown_clusters

Phylogenetic tree generation

Reference ASV sequence from fasta reference file are extracted accordingly to detected entities. Unknown OTUs seed sequence are added. The final file is fed to FastTree (default --fastest) to produce a tree file. Tree file is then implemented into the final phyloseq object. This allows for phylogeny of unknown OTUs and 16S based phylogeny taxonomical estimation of the entity. This step can be avoided with the --notree option.

Phylosequization

Alignments results, taxonomy, clustered unknown entities and 16S based phylogeny tree are used to produce a phyloseq object: NanoASV.rdata Please refer to the metadata.csv file in Minimal dataset to be sure to input the correct file format for phyloseq to produce a correct phyloseq object. You can choose not to remove Eukaryota, Chloroplasta and Mitochondria sequences (pruned by default) using --r_cleaning 0 A CSV file encompassing taxonomy and abundance is produced as well and stored into ./Results/CSV.

Output structure

OUTPUT/
└── Results
    ├── ASV
    │   ├── barcode01_abundance.tsv
    │   └── barcode02_abundance.tsv
    ├── CSV
    │   └── Taxonomy-Abundance_table.csv
    ├── Exact_affiliations
    │   ├── barcode01_Exact_affiliations.tsv
    │   └── barcode02_Exact_affiliations.tsv
    ├── Phylogeny
    │   └── ASV.tree
    ├── Rdata
    │   └── NanoASV.rdata
    ├── Tax
    │   ├── Taxonomy_barcode01.csv
    │   └── Taxonomy_barcode02.csv
    └── Unknown_clusters
        ├── Consensus_seq_OTU.fasta
        └── unknown_clusters.tsv

Most useful files for ecological analyses in NanoASV's output are Results/Rdata/NanoASV.rdata and Results/CSV/Taxonomy-Abundance_table.csv.

The Rdata formated phyloseq object encompassing Abundance and taxonomy tables, Full length 16S based phylogeny and sample metadata. It is fully compatible with phyloseq R package.
The Taxonomy-Abundance_table.csv file is the needed data to make analysis with spreadsheet software (Calc, Excel, etc.).

Unknown_clusters/ contains one fasta file containing the clusters consensus sequences. Those are the sequences used in the 16S phylogeny. unknown_clusters.tsv is the abundance table produced by vsearch and used in the phyloseq object generation.

Exact affiliations files are encompassing individual reads assignments, associated samflag and mapping quality (MapQ). Header = Read ID - Sam Flag - Assignement ID - MapQ value This file is useful if you suspect strange assignement with your dataset. It allows you to have an idea of the confidence behing assignement. Both in term of alignement sam flag and MapQ. Please note that MapQ is not an average nucleotide identity (ANI).

Results/Phylogeny/ASV.tree is the FastTree newick format generated full length 16S tree.

Results/ASV/ and Results/Tax/ are intermediate, individual barcode abundances and taxonomy tables. Those files are not meant to be used directly bu are phyloseq basis to produce the Rdata formated object. Those files are useful only if you have to rebuild the phyloseq object by yourself. In any case, they do not hide data that would be absent from nor phyloseq object nor the CSV file.

Benchmark

Nygaard dataset

We ran (Nygaard et al. 2020) dataset with three different software solution : SituSeq (Zorz et al. 2023), Nygaard manual pipeline (Nygaard et al. 2020) and NanoASV. It should be noted that NanoASV is the only one that outputs a phyloseq (McMurdie and Holmes 2013) object. The two others require manual file manipulation to achieve the same results. NanoASV can natively output a 16S phylogeny thanks to MAFFT (Katoh and Standley 2013) and FastTree (Price, Dehal, and Arkin 2009). The same reference dataset was used in each pipeline : DADA2 maintained training set silva_nr99_v138.1_train_set.fa. We made this choice as it was the easiest reference to use with SituSeq. Random UUID were assigned to each reference sequence to make it work with NanoASV.

Alpha Diversity

We see above in alpha diversity figure that NanoASV and Nygaard pipeline outputs show similar trends in matter of numerical richness and Shannon index. Despite lower values, the same trends are observed with SituSeq. This is shown in the following correlation matrix.

Taxonomic profile

We see with above taxonomical profile that Genus level taxonomical profile looks very similar between NanoASV and Nygaard pipeline. SituSeq had difficulies recovering a precise taxonomical profile with unassigned sequences representing 59 to 97% of a sample total taxonomical assignments (Any taxonomical level considered). While NanoASV and Nygaard pipelines both assigned 100% of sequences against the silva_nr99_v138.1_train_set.fa reference dataset. A Mantel test (vegan::mantel() with 999 permutations) was performed to compare different pipelines Bray-Curtis dissimilarity matrices. NanoASV and Nygaard pipeline showed high similarity (Mantel statistic r: 0.8735 – Significance: 0.001). NanoASV and SituSeq showed about half this value (Mantel statistic r: 0.5459 – Significance: 0.005). Nygaard and SituSeq showed the lowest similarity (Mantel statistic r: 0.3464 – Significance: 0.021).

Pipeline Specs comparisons

Specs on personal computer were obtained with usr/bin/time -v : Max Resident Set Size. On slurm cluster, maximum resident size was obtained through slurm command sacct. Multi-threaded jobs can be hard to track for memory consumption. Peak memory values are probably underestimated. Specs table shows that NanoASV is faster and more memory efficient than Nygaard pipeline and SituSeq.

Benchmark conclusion

SituSeq showed to be very different in term of output when compared to Nygaard pipeline and NanoASV. SituSeq did not recover an extensive taxonomical profile , but global trends still similar. NanoASV shows very similar trends as Nygaard pipeline, in matter of alpha diversity and taxonomical profile. NanoASV appeared around 6 times faster and more memory efficient than Nygaard manual pipeline.

Benchmarking Bibliography

Katoh, K., and D. M. Standley. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30 (4): 772–80. https://doi.org/10.1093/molbev/mst010.
McMurdie, Paul J., and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” Edited by Michael Watson. PLoS ONE 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.
Nygaard, Anders B., Hege S. Tunsjø, Roger Meisal, and Colin Charnock. 2020. “A Preliminary Study on the Potential of Nanopore MinION and Illumina MiSeq 16S rRNA Gene Sequencing to Characterize Building-Dust Microbiomes.” Scientific Reports 10 (1): 3209. https://doi.org/10.1038/s41598-020-59771-0.
Price, M. N., P. S. Dehal, and A. P. Arkin. 2009. “FastTree: Computing Large Minimum Evolution Trees with Profiles Instead of a Distance Matrix.” Molecular Biology and Evolution 26 (7): 1641–50. https://doi.org/10.1093/molbev/msp077.
Zorz, Jackie, Carmen Li, Anirban Chakraborty, Daniel A Gittins, Taylor Surcon, Natasha Morrison, Robbie Bennett, Adam MacDonald, and Casey R J Hubert. 2023. “SituSeq : An Offline Protocol for Rapid and Remote Nanopore 16S rRNA Amplicon Sequence Analysis.” ISME Communications 3 (1): 33. https://doi.org/10.1038/s43705-023-00239-3.

Acknowledgments

We are grateful to the genotoul bioinformatics platform Toulouse Occitanie (Bioinfo Genotoul, https://doi.org/10.15454/1.5572369328961167E12) for providing help, computing and storage resources. We deeply thank Enrique Ortega for its valuable contribution in the very first steps of the project. We deeply thank Antoine Cousson, Fiona Elmaleh and Meren for their role in software beta testing.

Citation

Please don't forget to cite NanoASV and dependencies if it helped you treat your Nanopore data Thank you!

Dependencies citations :

Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021. “Twelve Years of SAMtools and BCFtools.” GigaScience 10 (2): giab008. https://doi.org/10.1093/gigascience/giab008.

De Coster, Wouter, and Rosa Rademakers. 2023. “NanoPack2: Population-Scale Evaluation of Long-Read Sequencing Data.” Edited by Can Alkan. Bioinformatics 39 (5): btad311. https://doi.org/10.1093/bioinformatics/btad311.

Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Edited by Inanc Birol. Bioinformatics 34 (18): 3094–3100. ttps://doi.org/10.1093/bioinformatics/bty191.

Katoh, K., and D. M. Standley. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30 (4): 772–80. https://doi.org/10.1093/molbev/mst010.

McMurdie, Paul J., and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” Edited by Michael Watson. PLoS ONE 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.

Nygaard, Anders B., Hege S. Tunsjø, Roger Meisal, and Colin Charnock. 2020. “A Preliminary Study on the Potential of Nanopore MinION and Illumina MiSeq 16S rRNA Gene Sequencing to Characterize Building-Dust Microbiomes.” Scientific Reports 10 (1): 3209. https://doi.org/10.1038/s41598-020-59771-0.

Price, M. N., P. S. Dehal, and A. P. Arkin. 2009. “FastTree: Computing Large Minimum Evolution Trees with Profiles Instead of a Distance Matrix.” Molecular Biology and Evolution 26 (7): 1641–50. https://doi.org/10.1093/molbev/msp077.

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41 (D1): D590–96. https://doi.org/10.1093/nar/gks1219.

Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4 (October): e2584. https://doi.org/10.7717/peerj.2584.

Funding

The PhD grant that allowed this production is a Contrat Doctoral Spécifique Normalien (CDSN) granted by the École Normale Supérieure de Paris and the French Ministère de l’Enseignement Supérieur et de la Recherche. The present work was also funded by the French National Institute of Research for Development (IRD). This project was financially supported by ANR under the framework of the U2 Worm project (ANR-20-CE01-0015-01), and under the Investissements d’Avenir programme with the reference ANR-10-LABX-001-01 Labex Agro and coordinated by Agropolis Fondation, under the framework of the Innov’Earth Project (convention 2101-003) and the MetaCast project (convention 2202-214).

Name		Name	Last commit message	Last commit date
Latest commit History 627 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
config		config
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

ImagoXV/NanoASV

Folders and files

Latest commit

History

Repository files navigation

NanoASV

Options

Installation with Conda

Database setup

Test your installation

With a dry run

With mock dataset

ONT MK1C Installation

Database setup

R environment installation

Test run on MK1C device

How NanoASV works

Data preparation

About single barcode experiment

Filtering

Chimera detection

Adapter trimming

Subsampling

Alignment

Unknown sequences clustering

Phylogenetic tree generation

Phylosequization

Output structure

Benchmark

Nygaard dataset

Alpha Diversity

Taxonomic profile

Pipeline Specs comparisons

Benchmark conclusion

Benchmarking Bibliography

Acknowledgments

Citation

Funding

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Contributors 3

Languages