Maintainer: Ji-Ping Wang, <jzwang@northwestern.edu>; Keren Li, <keren.li@northwestern.edu>
Licence: GPLv3
Cite DNAcycP package:
Li, K., Carroll, M., Vafabakhsh, R., Wang, X.A. and Wang, J.-P., DNAcycP: A Deep Learning Tool for DNA Cyclizability Prediction, Nucleic Acids Research, 2021
DNAcycP, short for DNA cyclizablity Prediction, is a Python package for accurate predict of DNA intrinsic cyclizablity score. It was built upon a deep learning architecture with a hybrid of Inception and Residual network structure and an LSTM layer. DNAcycP was trained based on loop-seq data from Basu et al 2021 (see below). The predicted score, termed C-score achieves high accuracy compared to the experimentally measured cyclizablity score by loop-seq assay.
DANcycP is available in two formats: A web server available at http://DNAcycP.stats.northwestern.edu for real-time prediction and visualization of C-score up to 20K bp, and a standalone Python package available for free download from https://github.com/jipingw/DNAcycP.
The core of DNAcycP is a deep learning architecture mixed with an Inception-ResNet structure and an LSTM layer (IR+LSTM, Fig 1b) that processes the sequence and its reverse complement separately, the results from which are averaged and detrended to reach the predicted intrinsic score. (Fig 1a).
IR+LSTM starts with a convolutional layer for dimension reduction such that the encoded sequence space is reduced from 2D to 1D. The output is fed into an inception module that contains two parallel branches, each having two sequentially connected convolutional layers with branch-specific kernels to capture sequence features of different scale. The first branch has kernel dimension 3x1 for both layers and the second has kernel dimension 11x1 and 21x1 sequentially. The output of the inception module is combined by concatenation and added back to the input of the inception module to form a short circuit or residual network. Finally, the IR+LSTM concludes with a dense layer to predict output with linear activation function.
bio==1.3.3
tensorflow==2.7.0
keras==2.7.0
pandas==1.3.5
numpy==1.21.5
docopt==0.6.2
DNAcycP Python package requires specific versions of dependencies. We recommend to install and run DNAcycP in a virtual environment. For example, suppose the downloaded DNAcycP package is unpacked as a folder dnacycp-main
. We can install DNAcycP in a virtual environment as below:
cd dnacycp-main
python3 -m venv env
source env/bin/activate test
pip install -e .
Run dnacycp-cli
to see whether it is installed properly.
dnacycp-cli
Once done with DNAcycP for prediction, you can close the virtual environment by using:
deactivate
Once the virtual environment is deactivated, you need to re-activate it before you run another session of prediciotn as follows:
cd dnacycp-main
source env/bin/activate test
DNAcycP supports the input sequence in two formats: FASTA format (with sequence name line beginning with “>”) or plain TXT format. Unlike in the web server version where only one sequence is allowed in input for prediction, the Python package allows multiple sequences in the same input file. In particular for the TXT format, each line (can be of different length) in the file is regarded as one input sequence for prediction.
The main funciton in DNAcycP is dancycp-cli
, which can be called as follows:
dnacycp-cli -f/-t <inputfile> <basename>
where
-f/-t
: indicates the input file name in FASTA or TXT format respectively; either one must be specified.<inputfile>
: is the name of the intput file;<basename>
: is the name base for the output file.
dnacycp-cli -f ./data/raw/ex1.fasta ./data/raw/ex1
The -f
option specifies that the input file named "ex1.fasta" is in fasta format.
The ./data/raw/ex1.fasta
is the sequence file path and name, and ./data/raw/ex1
specifies the output file will be saved in the directory ./data/raw
with file name initialized with ex1
.
For example, ex1.fasta
contains two sequences named ">seq1" and ">myseq2" respectively.
The output file will be named as "ex1_cycle_seq1.txt", "ex1_cycle_myseq2.txt"for the first and second sequences respectively. Each file contains three columns: position
, C_score_norm
, C_score_unnorm
. The C_score_norm
is the predicted C-score from the model trained based on the standardized loop-seq score of the tiling library of Basu et al 2021 (i.e. 0 mean unit variance). The C_score_unnorm
is the predicted C-score recovered to the original scale of loop-seq score in the tiling library data from Basu et el 2021. The standardized loop-seq score provides two advantages. As loop-seq may be subject to a library-specific constant, standardized C-score is defined with a unified baseline as yeast genome (i.e. 0 mean in yeast genome). Secondly, the C-score provides statisitcal significance indicator, i.e. a C-score of 1.96 indicates 97.5% in the distribution.
dnacycp-cli -t ./data/raw/ex2.txt ./data/raw/ex2
With -t
option, the input file is regarded as in TXT format, each line representing a sequence without sequence name line that begins with ">".
The predicted C-scores will be saved into two files, one with _unnorm.txt
and the other with _norm.txt
for unnormalized and normalized C-score, with C-scores in each line corresponding to the sequence in the input file in the same order.
For any input sequence, DNAcycP predicts the C-score for every 50 bp. Regardless of the input sequence format the first C-score in the output file corresponds to the sequence from position 1-50, second for 2-51 and so forth.
from dnacycp import cycle_fasta, cycle_txt
cycle_fasta("data/raw/ex1.fasta","example1")
cycle_txt("data/raw/ex2.txt","example2")
- Basu, A., Bobrovnikov, D.G., Qureshi, Z., Kayikcioglu, T., Ngo, T.T.M., Ranjan, A., Eustermann, S., Cieza, B., Morgan, M.T., Hejna, M. et al. (2021) Measuring DNA mechanics on the genome scale. Nature, 589, 462-467.