This is the README document for DeCatCounter, a Python pipeline for processing concatenated PacBio reads from in vitro selection experiments. The pipeline can demultiplex the amplicons, deconcatenatenate into sequences and count their count reads. DeCatCounter can be used to process nucleotides or amino acids sequencing data.
The pipeline tool is written in Python. It can be run using a Python interpreter, like the Command Line Interface (aka the Terminal) or any other specific software (e.g. PythonWin) and can be run using any version of Python 3. To run the script from the Terminal, type:
python DeCatCounter.py input_file barcodes.txt bc_tol_f bc_tol_r constant.txt ct_tol_f ct_tol_r translation(y/n) low_len hi_len
- input_file: name of input file (must include the full path to the directory where it's located).
- barcodes.txt: text file with 3 columns: 1) sample name, 2)corresponding forward barcode, 3) reverse barcode.
- bc_tol_f: error tolerance for forward barcode search in the forward reads (for the reverse reads, this is the error tolerance for the reverse complement of the reverse barcode).
- bc_tol_r: error tolerance for reverse barcode search.
- constant.txt: text file with 2 lines: 1) forward constant region, 2) reverse constant region.
- ct_tol_f: error tolerance for forward constant region search.
- ct_tol_r: error tolerance for reverse constant region search.
- translation(y/n): whether translation to amino acids should be performed, value should be either y or n.
- low_len: minimum length for final DNA variants.
- hi_len: maximum length for final DNA variants.
We recommend using Anaconda to create a virtual environment. Although this is not necessary, using a virtual environment can prevent version conflicts and undesired upgrades/downgrades on already existing packages.
Once the virtual environment is active, Python and the other required dependencies can be installed there. The following dependencies are needed: biopython, python-Levenshtein, tabulate and pandas. Dependencies can either be installed one by one, or using the requirements.txt
file.
To create a virtual environment and install all dependencies:
# create environment
conda create -n myenv python=3
# activate environment
source activate myenv
# install additional dependencies
conda install --file requirements.txt
Alternatively (although not recommended), DeCatCounter can be run from a local environment, where Python must be installed. In this case, we recommend using the Anaconda distribution of Python, and adding the Bioconda channel to Anaconda's package manager, Conda. See the Anaconda documentation for installation.
The script requires three input files: 1) sequencing reads, 2) barcodes text file and 3) constant regions text file. Sequencing reads are assumed to be either in FASTA or FASTQ format.
The barcodes files should be a text file with 3 columns: 1) sample name, 2)corresponding forward barcode, 3) reverse barcode. For example:
The constant regions files should have 2 lines: 1) forward constant region, 2) reverse constant region. For example:
The pipeline will generate an output directory, called output+date&time
. Inside this directory, there will be a subdirectory for the count files and, if translation has been requested, a subdirectory for the amino acid count files.
There pipeline also creates a log file with the parameters used and a table listing the number of amplicons recovered after demultiplexing, and the number of sequences recovered after deconcatenation and after trimming and filtering by length.
A mock, test dataset (test_input.fasta) is provided in the test_data folder, together with barcodes and constant regions text files (barcodes.txt, constant.txt). To run the test dataset, place all files (DeCatCounter.py, test_input.fasta, barcodes.txt and constant.txt) in the same folder, and from within that folder type:
python DeCatCounter.py test_input.fasta barcodes.txt 0 0 constant.txt 0 0 y 5 50
If everything went well, your terminal should look like this:
and you should have a new folder in your directory:
A more realistic, test dataset (test_input_2.fasta) is provided in the test_data_2 folder, together with barcodes and constant regions text files (barcodes_2.txt, constant_2.txt). The test dataset corresponds to a subset of the raw dataset analyzed in the publication and the barcodes and constant regions are those used for the analysis. To run the test dataset, place all files (DeCatCounter.py, test_input_2.fasta, barcodes_2.txt and constant_2.txt) in the same folder (note that test_input_2.fasta needs to be decompressed), and from within that folder type:
python DeCatCounter.py test_input_2.fasta barcodes_2.txt 2 2 constant_2.txt 4 2 y 707 825
If everything went well, your terminal should look like this:
and you should have a new folder in your directory:
Please report any bugs to Celia Blanco (celiablanco@ucla.edu).
When reporting bugs, please include the full output printed in the terminal when running the pipeline.
Nisha Kanwar*, Celia Blanco*, Irene A. Chen and Burckhard Seelig. PacBio sequencing output increased through uniform and directional 5-fold concatenation. Submitted.