Skip to content

marisalim/Everest-eDNA-biodiversity-survey

Repository files navigation

Everest eDNA Biodiversity Survey Analysis

DOI

This repository holds the scripts we used to analyze Everest WGS (for bacterial and eukaryote hits) and metabarcoding data.

Contents

  1. Required software
  2. Scripts
  3. WGS eukaryote pipeline details
  4. Metabarcoding blast script details
  5. Run WGS pipeline
  6. Run metabarcoding blast script
  7. Compare datasets

Required software

bwa [Version: 0.7.15-r1140]
samtools [Version: 1.9 (using htslib 1.9)]
spades [SPAdes genome assembler v3.10.1]
BBMerge [v38.22] (part of BBtools suite)
blastn [Nucleotide-Nucleotide BLAST 2.7.1+]
singleM [v0.13.2]

Scripts

These files were run on an HPC to map WGS reads to reference mitochondrial and chloroplast genomes. All python scripts written for Python3.

Disclaimer: these are the scripts we used; they were developed on a MacOS and are not meant to be an out-of-the-box pipeline. If you would like to use them and have any questions, please post an issue.

WGS-bacteria:

  • batch scripts for running default singleM ribosomal panel: singlem_batch.sh
  • batch scripts for running singleM with Greengenes database: singlem_16s_R1s_batch.sh and singlem_16s_R2s_batch.sh

WGS-eukaryotes:

  • BWA mapping commands are in bwa_batcher.sh
  • Blast parsing commands are in mapblast_parse.py
  • primary batch script is run_batcher.sh
  • add taxonomic info, plot results: wgs_blast_dataviz.r
  • an example snakemake workflow for this pipeline

metabarcoding:

  • Blast parsing commands are in metabar_blast_parse.py
  • add taxonomic info, plot results: metabar_blast_dataviz.r

Comparing results from metabarcoding vs. WGS:

  • compare datasets: compare_eDNAdat_splists.r
  • barplots and heatmaps: heatmap_plts.r

Asymptotic regression model script

WGS eukaryote pipeline details

This pipeline was designed to specifically pull out Eukaryotic taxa from the WGS sequencing data. Due to the incomplete status of Eukaryotic taxonomic representation in reference sequence databases and from the Everest region, this is an imperfect approach. However, it works reasonably well for taxon discovery and exploration. It would require further tuning to improve completeness in taxon identification. Here are some details about what the various pipeline scripts are doing:

1. Map to reference

  • map to mitochondrial reference or whatever is available (download from NCBI). The goal of this approach is to find Eukaryotic taxa. Since they represent a minority of the WGS data, the goal of this mapping step is to narrow the data down to reads more likely to match Eukaryote references of choice and filter out non-target reads (e.g., bacteria and other microorganisms).

    • note: this entire approach works better with small genomes like mitochondrial or chloroplast references. Quite slow on whole genomes.
    • e.g., mitochondrial refs take about 2 days or a little more to run all 22 samples
    • e.g., Juniperus chloroplast genome was better for pulling out plant reads, also had bacteria, so filtered blast results to only keep chloroplast gene hits
  • map paired-end reads with bwa mem

  • bwa_batcher.sh has the bwa commands

  • optional step: (not run in pipeline code, but you can run this separately to view mappings in terminal)

    convert .sam to .bam with samtools faidx and samtools import sort BAM file with samtools sort and samtools index view bam file in tview with samtools tview

2. Subset SAM files for 'good' hits

  • awk '$4 != "0"' [sample].sam > [sample]_hits.txt to check if there are any reads that mapped, plus associated mapping score info. if the file is blank, then no mapping. Thus, I'm filtering on this field, and removing the rows that have 0's.

The $4 field is the 1-based leftmost mapping position of SAM files. Note the following:

"The first base in a reference sequence has coordinate 1. POS is set as 0 for an unmapped read without coordinate. If POS is 0, no assumptions can be made about RNAME and CIGAR."

More about SAM files:

3. Convert filtered SAM file to interleaved fasta file

use: samtools fastq

4a. Assemble interleaved reads with spades

4b. Option: merge reads if no spades contigs

  • use BBmerge to produce merged and unmerged reads
  • for our 150bp paired-end WGS dataset, means you can get a max. 300bp merged read, but likely shorter, but better than the 150bp unpaired read

5. Nucleotide blast search for spades contigs and/or merged & unmerged reads

  • Edit per your study goals: note different blast thresholds for percent sequence similarity (-qcov_hsp_perc) and percent query coverage (-perc_identity), output has top 10 hits per query (-max_target_seqs) on lines 40 and 50.

6. Parse blast output per sample and reference, summarize

  • parse to keep best hit per query (sorted by bit score - mostly works, but in some cases, e.g., a tie, you need to manually check)

Metabarcoding blast script details

The metabarcoding data was primarily analyzed in Geneious to generate contigs. After blast search to NCBI nt database, the blast parsing step was conducted with metabar_blast_parse.py. Same parsing logic as for WGS data.

Running the WGS pipeline

  1. Set up blast reference database. For our Everest paper, I set up a local installation of the nt database (unstable connection errors on HPC made remote run option erratic) with this script. Blast db set up takes a long time (many hours) because nt db is very large; make sure you have enough space to save the database files. Do the following:
mkdir NCBI_blast_nt
cd NCBI_blast_nt
# download of nt.gz takes about 40 minutes (compressed file is 65Gb)
# makeblastdb part took about an hr (db files (nhr, nin, nog, nsd, nsi, nsq; index nal) total = 73 Gb)
  
sbatch nt_wrap.sh
  1. Edit input paths in the run_batcher.sh script. These 4 variables must be edited with correct path/name:
# path to reference genome fasta file
REF='EverestMetaGenomics/BWA_MAPPING/mito_refs/Chironomus_tepperi_mitogenome.fasta'
# abbreviated name for reference (used for output file naming)
REFNAME='chiro'
# path to sequence read fastq files
FQPATH='EverestMetaGenomics/METAGEN'
# path to NCBI blast database
NCBI='EverestMetaGenomics/NCBI_blast_nt'
  1. Run script. The blast parsing step, in particular, takes a while, so I recommend running this script on an HPC.
  • Note on timing: the entire pipeline can take 20-48+ hours to run with mitochondrial or chloroplast genome refs; takes upwards of a week or more for full genomes - the rate limiting step is my blast parser. This is not particularly efficient for whole genome refs (too many hits, so parsing is slow) but works quickly for small genomes like mito or chloroplast!

sbatch run_batcher.sh

Run metabarcoding blast script

Ran metabar_blast_parse.py as a bash loop, where --blastfile is the blast output (format 6) and the --queryseqfile is the contig or read input sequence file for blast search (used to add the sequence to parsed output file), e.g.,:

for i in {13..33} 35; do python metabar_blast_parse.py --blastfile ts${i}_barcontig_blastout --queryseqfile ts${i}_metabar_contigs.fasta; done

Compare datasets

At this point in the analysis, I did post-processing of the blast results in R to:

About

Pipeline scripts for analyzing Everest WGS and metabarcoding data

Resources

License

Stars

Watchers

Forks

Packages

No packages published