Skip to content

DeCatCounter: Python pipeline for processing concatenated PacBio sequencing data from in vitro selection experiments

Notifications You must be signed in to change notification settings

ichen-lab-ucsb/DeCatCounter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeCatCounter

This is the README document for DeCatCounter, a Python pipeline for processing concatenated PacBio reads from in vitro selection experiments. The pipeline can demultiplex the amplicons, deconcatenatenate into sequences and count their count reads. DeCatCounter can be used to process nucleotides or amino acids sequencing data.

Usage

The pipeline tool is written in Python. It can be run using a Python interpreter, like the Command Line Interface (aka the Terminal) or any other specific software (e.g. PythonWin) and can be run using any version of Python 3. To run the script from the Terminal, type:

python DeCatCounter.py input_file barcodes.txt bc_tol_f bc_tol_r constant.txt ct_tol_f ct_tol_r translation(y/n) low_len hi_len

sequences

  • input_file: name of input file (must include the full path to the directory where it's located).
  • barcodes.txt: text file with 3 columns: 1) sample name, 2)corresponding forward barcode, 3) reverse barcode.
  • bc_tol_f: error tolerance for forward barcode search in the forward reads (for the reverse reads, this is the error tolerance for the reverse complement of the reverse barcode).
  • bc_tol_r: error tolerance for reverse barcode search.
  • constant.txt: text file with 2 lines: 1) forward constant region, 2) reverse constant region.
  • ct_tol_f: error tolerance for forward constant region search.
  • ct_tol_r: error tolerance for reverse constant region search.
  • translation(y/n): whether translation to amino acids should be performed, value should be either y or n.
  • low_len: minimum length for final DNA variants.
  • hi_len: maximum length for final DNA variants.

Environment setup

We recommend using Anaconda to create a virtual environment. Although this is not necessary, using a virtual environment can prevent version conflicts and undesired upgrades/downgrades on already existing packages.

Once the virtual environment is active, Python and the other required dependencies can be installed there. The following dependencies are needed: biopython, python-Levenshtein, tabulate and pandas. Dependencies can either be installed one by one, or using the requirements.txt file.

To create a virtual environment and install all dependencies:

# create environment
conda create -n myenv python=3

# activate environment
source activate myenv

# install additional dependencies
conda install --file requirements.txt 

Alternatively (although not recommended), DeCatCounter can be run from a local environment, where Python must be installed. In this case, we recommend using the Anaconda distribution of Python, and adding the Bioconda channel to Anaconda's package manager, Conda. See the Anaconda documentation for installation.

Input

The script requires three input files: 1) sequencing reads, 2) barcodes text file and 3) constant regions text file. Sequencing reads are assumed to be either in FASTA or FASTQ format.

The barcodes files should be a text file with 3 columns: 1) sample name, 2)corresponding forward barcode, 3) reverse barcode. For example:

The constant regions files should have 2 lines: 1) forward constant region, 2) reverse constant region. For example:

Output

The pipeline will generate an output directory, called output+date&time. Inside this directory, there will be a subdirectory for the count files and, if translation has been requested, a subdirectory for the amino acid count files.

There pipeline also creates a log file with the parameters used and a table listing the number of amplicons recovered after demultiplexing, and the number of sequences recovered after deconcatenation and after trimming and filtering by length.

Test dataset #1

A mock, test dataset (test_input.fasta) is provided in the test_data folder, together with barcodes and constant regions text files (barcodes.txt, constant.txt). To run the test dataset, place all files (DeCatCounter.py, test_input.fasta, barcodes.txt and constant.txt) in the same folder, and from within that folder type:

python DeCatCounter.py test_input.fasta barcodes.txt 0 0 constant.txt 0 0 y 5 50

If everything went well, your terminal should look like this:

and you should have a new folder in your directory:

Test dataset #2

A more realistic, test dataset (test_input_2.fasta) is provided in the test_data_2 folder, together with barcodes and constant regions text files (barcodes_2.txt, constant_2.txt). The test dataset corresponds to a subset of the raw dataset analyzed in the publication and the barcodes and constant regions are those used for the analysis. To run the test dataset, place all files (DeCatCounter.py, test_input_2.fasta, barcodes_2.txt and constant_2.txt) in the same folder (note that test_input_2.fasta needs to be decompressed), and from within that folder type:

python DeCatCounter.py test_input_2.fasta barcodes_2.txt 2 2 constant_2.txt 4 2 y 707 825

If everything went well, your terminal should look like this:

and you should have a new folder in your directory:

Reporting bugs

Please report any bugs to Celia Blanco (celiablanco@ucla.edu).

When reporting bugs, please include the full output printed in the terminal when running the pipeline.

Citation

Nisha Kanwar*, Celia Blanco*, Irene A. Chen and Burckhard Seelig. PacBio sequencing output increased through uniform and directional 5-fold concatenation. Submitted.

About

DeCatCounter: Python pipeline for processing concatenated PacBio sequencing data from in vitro selection experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages