Skip to content
/ Menace Public

This pipeline is a free implementation of the PTR extraction pipeline presented by Korem et. al, Science 2015 with many modifications and extra functionality to facilitate fast and easy analysis of relative cell periods of growing bacteria from metagenomics data.

License

Notifications You must be signed in to change notification settings

zertan/Menace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Menace

This bundle of software is a basic implementation of the algorithm for extracting Peak-to-Trough Ratios from Metagenomic data, as first described in (Korem et. al, Science, 2015).

Installation:

Pip

Make sure that "pip" is the PyPi command of your python2 installation, then:

pip install menace

Git

git clone git@github.com:zertan/Menace.git
cd Menace
python setup.py install

This should install the below python dependencies. The other dependencies have to be installed manually (if you have questions about this I suggest you consult your cluster IT help desk).

The software has been tested on the "hebbe" cluster at C3SE which uses the "slurm" system for resource management (thus slurm is the only queueing system currently supported).

Dependencies:

Python2:
numpy
scipy
pandas
biopython
matplotlib
xmltodict
configparser
lmfit
newick
Jinja2
doric
-e git+https://github.com/PathoScope/PathoScope.git#egg=pathoscope

samtools

bamtools

bowtie2

Pathoscope 2.0 (should be installed by the above pip command but make sure 'pathoscope ID' is accessible in the shell, ie. is on the system path)

parallel

DoriC is a databse of chromosome origin locations (OriCs) which is a (recommended) optional dependency for the pipeline. Please visit the link and enter your e-mail to download.

Usage

You can get an overview of the menace functionality by running menace -h.

  1. Initialize a project in current directory by running menace init. Identify a set of NCBI genome reference accession numbers and put them in "./searchStrings" (or use the default one which includes a minimal set of references to bacteria common in the human gut).

  2. Identify a metagenomic cohort of interest (download manually or add URLs as described below) and add to the Data folder. Supported input: raw/gzipped/bzipped ".fastq" files.

  3. Add information to the project.conf file.

  4. Edit loadmodules.sh to include the python2 module of the cluster (or comment out the lines if python2 is accessible by default).

  5. Run menace full (use "nohup {cmd} &" to keep alive after logout if on a cluster login node).

  6. Wait for job to complete. Run menace collect in project directory.

Notes

The menace script is a common utility for all parts of the pipeline including downloading of references and metagenomic data, bulding a reference index, setting up the necessary file structure and submitting to slurm. Hence, all configuration is intended to be set up in project.conf (please see bin/project.conf.example for an example).

The default 'searchStrings' will most probably not fit your purposes but is only an example. A more comprehensive Reference library will yield higher coverage and more accurate values. A more comprehensive list of human gut bacteria is available at 'extra/referenceACClong.txt'.

Directory structure (example)

With the above usage example the path structure(s) will look something like below.

$DATA_PATH
  ├ "Sample01"								       (eg. ERR525688)
  .  ├ {sample01_1.fastq.gz}
  .  └ {sample01_2.fastq.gz} 				 paired metagenomic reads
  .

$REF_PATH
  ├ Index
  |  └ {REF_NAME.*.bt2l}					    bowtie2 index files
  ├ Fasta
  |  └ {accession.fasta}
  ├ Headers
  |  └ {accession.xml}						    xml files containing extra genome references info
  └  taxIDs.txt

$DORIC_PATH
  ├ bacteria_record.dat
  └ bacteria_seq.fas

$OUTPUT_PATH
  ├ "Sample01"
  .  ├ depth
  .  |  └ {accession.depth} 				  coverage files for each reference
  .  ├ log
     |  └ {accession.log}					    output logs from piecewiseFit	
     ├ npy
     |  └ {accession_OriC_TerC.npy}		numpy files with origin/terminus locations and relative C periods
     ├ png
     |  └ {accession_fit.png}  				images of piecewise fit of the smoothed coverage
     └ accession-sam-report.tsv				Pathoscope2 reassignment report

Contents

Below follows a description of the main scripts in the package.

jobscript

A submit script for sending a batch job to slurm for parallel processing on a computing cluster.

input: none

output: directory structure as specified in "project.conf"

mainBuild.sh

The main build script with commands intended to be executed on the cluster.

input: none

output: temporary paths and files on compute nodes

PTRMatrix.py

Traverses the specified directory generated by mainBuild.sh and assembles information from each sample into tabular form (eg. averages origin locations from many samples for a better estimate).

input: $OUTPUT_PATH, $DORIC_PATH, $REF_PATH, bin/accLoc.csv

output: Abundance.csv, PTR.csv, DoublingTime.csv, Header.csv

piecewiseFit.py

Implements the piecewise linear fit and prior checks on the generated depth files to filter out those instances in which enough data was generated to produce a reliable coverage signal for estimating replication origins. This data can be used further on, once those has been estimated using the full cohort, to produce PTR-vaules for each sample.

input: {reference.depth}

output: {reference_OriC.npy}, {reference_TerC.npy}, {reference_coverage.png}, {reference_fit.log}

fetchSeq.py

This utility can be used to download '.fasta' reference files from the NCBI servers.

input: searchStrings.txt,

output: {reference.fasta}, {reference.xml}, taxIDs.txt

About

This pipeline is a free implementation of the PTR extraction pipeline presented by Korem et. al, Science 2015 with many modifications and extra functionality to facilitate fast and easy analysis of relative cell periods of growing bacteria from metagenomics data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published