-
Notifications
You must be signed in to change notification settings - Fork 10
CCMpredPy
Susann Vorberg edited this page Jun 11, 2018
·
7 revisions
Learn a Markov random field (MRF) model of evolutionary couplings between protein residues from a multiple sequence alignment (MSA) representative of the protein family.
ccmpred [options] alnfile
alnfile
is an input multiple protein sequence alignment to learn couplings from.
-
MSA format (
--aln-format <format>
): Specify which format to parse input alignments in. Supports all BioPython Bio.SeqIO file formats pluspsicov
. -
Number of threads (
--num-threads <n>
): The number of threads used to parallelize specific computations. -
Printing of logo (
--no-logo
): Disable showing the CCMpred logo -
Number of iterations (
--maxit <max_it>
): The maximum number of iterations to optimize the MRF model for. [default: 2000]
-
Contact matrix output (
-m, --mat-file <matfile>
): Compute and write out contact matrix files -
MessagePack binary raw output (
-b, --write-binary-raw <rawfile>
): Write out raw coupling potentials in a compact binary MessagePack representation -
Optimization progress output (
--plot-opt-progress
) Continously plot optimization progress as an interactive HTML
-
Initial potentials (
-i, --init-from-raw <rawfile>
): Load initial parameters from a raw coupling file and start optimizing from there -
Skip Optimization (
--do-not-optimize
): Do not optimize model parameters. Requires providing initial model parameters with-i
.
-
Maximize Pseudo-Likelihood (
--ofn-pll
): Learn MRF model by maximizing the pseudo-likelihood with the LBFGS algorithm. -
LBFGS Ftol (
--lbfgs-ftol <ftol>
): convergence criterion ftol for LBFGS algorithm. [default: 1e-4] -
LBFGS Max Linesearch (
--lbfgs-max-linesearch <max_ln>
): maximum number of linesearch steps. [default: 5] -
LBFGS Maxcor (
--lbfgs-maxcor <max_cor>
): maximum number of corrections for memory. [default: 5]
The LBFGS optimizer uses the scipy.optimize.min
library with method='L-BFGS-B'
-
Maximize with Contrastive Divergence (
--ofn-cd
): Learn MRF model by applying the contrastive divergence algorithm with the gradient descent (GD) algorithm. -
Persistent Markov Chains (
--persistent
): Switch to persistent contrastive divergence once the learning rate is small enough (< alpha_0 / 10) [default: False] -
Number of Markov Chains (
--nr-markov-chains <nr_mc>
): Number of parallel Markov chains used for sampling sequences in each iteration of the optimization. [default: 500] -
Number of Gibbs steps (
--gibbs_stepss <gibbs_steps>
): Number of Gibbs steps used to evolve each Markov chain [default: 1] -
GD: Initial learning rate (
--alpha0 <alpha0>
): Set the initial learning rate for the gradient descent optimizer. A value of 0 will determine the learning rate as a function of Neff [default: 0] -
GD: Do not use decaying learning rate (
--no-decay
): Do not use decaying learnign rates. Decay is started when convergence criteria falls below value of<decay_start>
[default: False] -
GD: Start of decay (
--decay-start <decay_start>
): Decay is started when convergence criteria falls below value of<decay_start>
[default: 1e-1] -
GD: Rate of decay (
--decay-rate <decay_rate>
): The rate of decay for the learning rate [default: 5e-6] -
GD: Type of decay (
--decay-type <decay_type>
): The type of decay. One of: 'sig', 'sqrt', 'exp', 'lin' [default: 'sig'] -
Convergence criteria (
--epsilon <eps>
): relative change of norm of parameters in last <convergence_prev> iterations less than . [default: 1e-8] -
Number of previous iterations (
--convergence_prev <convergence_prev>
): The number of previous iterations to consider for evaluating convergence criteria [default: 5]
-
PDB file (
--pdb-file <pdbfile>
): Reference structure for protein family -
True Positive Contact (
--contact-threshold <contact_thr>
): Definition for residue pairs forming a contact regarding the distance of their Cbeta atoms in angstrom [default: 8]
-
Average Product Correction (APC) (
--apc <apc_file>
): Write out contact matrix file that has been corected with APC -
Entropy Correction (
--entropy-correction <entropy_correction_file>
): Write out contact matrix file that has been corrected for entropy bias
-
'Simple' weighting (
--wt-simple
): Use simple sequence weights calculated as w_n = 1/(1 + ID_n) where ID_n is the number of other sequences in the MSA with at least<wt_cutoff> * 100
percent sequence identity to sequence n). [default] -
Uniform weights (
--wt-uniform
): Assign w_n = 1 to all sequences. -
Sequence Identity Cutoff (
--wt-cutoff <wt_cutoff>
): Consider sequences as similar that have more than<wt_cutoff> * 100
percent identical positions
-
regularization coefficient singles (
--reg-lambda-single <λsingle>
): Set the regularization coefficient<λsingle>
for the single potentials [default: 10] -
regularization coefficient pairs (
--reg-lambda-pair-factor <λpair_factor>
): Set the regularization coefficient for couplings to<λpair> = L x <λpair_factor>
given an alignment with L columns [default: 0.2] -
Zero-centered Prior (
--v-zero
): Set mu=0 in standard Gaussian prior for single potentials and initialise single potentials at zero. [default: False] -
PWM-centered Prior (
--v-center
): Set mu=v* in standard Gaussian prior for single potentials and initialise single potentials at v*. v* represents the PWM model computed from the alignment. [default: True]
-
max gaps per position (
--max-gap-pos <max_gap_pos>
): Ignore alignment positions with more than<max_gap_pos>
percent gaps. [default: 100 --> no removal] -
max gaps per sequence (
--max-gap-seq <max_gap_seq>
): Remove sequences with more than<max_gap_seq>
percent gaps. [default: 100 --> no removal]
-
Add uniform pseudocounts (
--pc-uniform
): Use uniform pseudocounts, e.g 1/21 [default] -
Add substitution matrix pseudocounts (
--pc-submat
): Use substitution matrix pseudocounts from Blosum62 matrix. -
Add constant pseudocounts (
--pc-constant
): Use constant pseudocounts -
Add no pseudocounts (
--pc-none
): Do not add pseudocounts -
Pseudocount for singles (
--pc-single-count <pc_single>
): Specify number of pseudocounts for single frequencies [default: 1] -
Pseudocount for pairs (
--pc-pair-count <pc_pair>
): Specify number of pseudocounts for pairwise frequencies [default: 1]
-
OMES (
--compute-omes
): Compute OMES score as in Kass and Horovitz (2002) -
OMES (variant) (
--omes-fodoraldrich
): Compute OMES score according to Fodor & Aldrich (2004) -
Mutual information (MI) (
--compute-mi
): Compute mutual information (MI) -
normalized MI (
-mi-normalized
): Normalize mutual information according to Martin et al (2005) -
MI with pseudo-counts (
--mi-pseudocounts
): Add psuedocounts when computing mutual information
Recover couplings for data/1atzA.fas
and write summed score matrix to data/1atzA.noapc.mat
and APC corrected score matrix to data/1atzA.apc.mat
and the raw coupling potentials to data/1atzA.braw.gz
:
ccmpred data/1atzA.fas \
-m data/1atzA.noapc.mat \
--apc data/1atzA.apc.mat \
-b data/1atzA.braw.gz
Per default (--ofn-pll
) CCMpredPy maximizes the pseudo-likelihood to obtain couplings. Results differ slightly from the C implementation of CCMpred due to the following modifications:
- single potential regularization prior is centered at maximum-likleihood estimate of single potentials v*
- single potentials are initialized at v*
- regularization strength lambda_v = 10 in order to achieve comparable results to C implementation of CCMpred
- using the LBFGS optimizer instead of the conjugate gradient optimizer libconjugrad used in CCMpred. LBFGS yields comparable results while being faster but requiring more memory than conjugate gradients