CCMpredPy

Learn a Markov random field (MRF) model of evolutionary couplings between protein residues from a multiple sequence alignment (MSA) representative of the protein family.

ccmpred [options] alnfile

alnfile is an input multiple protein sequence alignment to learn couplings from.

General Options

MSA format (--aln-format <format>): Specify which format to parse input alignments in. Supports all BioPython Bio.SeqIO file formats plus psicov.
Number of threads (--num-threads <n>): The number of threads used to parallelize specific computations.
Printing of logo (--no-logo): Disable showing the CCMpred logo
Number of iterations (--maxit <max_it>): The maximum number of iterations to optimize the MRF model for. [default: 2000]

Output Options

Contact matrix output (-m, --mat-file <matfile>): Compute and write out contact matrix files
MessagePack binary raw output (-b, --write-binary-raw <rawfile>): Write out raw coupling potentials in a compact binary MessagePack representation
Optimization progress output (--plot-opt-progress ) Continously plot optimization progress as an interactive HTML

Optional Input Options

Initial potentials (-i, --init-from-raw <rawfile>): Load initial parameters from a raw coupling file and start optimizing from there
Skip Optimization (--do-not-optimize): Do not optimize model parameters. Requires providing initial model parameters with -i.

Options for Pseudo-Likelihood Optimization

Maximize Pseudo-Likelihood (--ofn-pll): Learn MRF model by maximizing the pseudo-likelihood with the LBFGS algorithm.
LBFGS Ftol (--lbfgs-ftol <ftol>): convergence criterion ftol for LBFGS algorithm. [default: 1e-4]
LBFGS Max Linesearch (--lbfgs-max-linesearch <max_ln>): maximum number of linesearch steps. [default: 5]
LBFGS Maxcor (--lbfgs-maxcor <max_cor>): maximum number of corrections for memory. [default: 5]

The LBFGS optimizer uses the scipy.optimize.min library with method='L-BFGS-B'

Options for Optimization with Persistent Contrastive Divergence

Maximize with Contrastive Divergence (--ofn-cd): Learn MRF model by applying the contrastive divergence algorithm with the gradient descent (GD) algorithm.
Persistent Markov Chains (--persistent): Switch to persistent contrastive divergence once the learning rate is small enough (< alpha_0 / 10) [default: False]
Number of Markov Chains (--nr-markov-chains <nr_mc>): Number of parallel Markov chains used for sampling sequences in each iteration of the optimization. [default: 500]
Number of Gibbs steps (--gibbs_stepss <gibbs_steps>): Number of Gibbs steps used to evolve each Markov chain [default: 1]
GD: Initial learning rate (--alpha0 <alpha0>): Set the initial learning rate for the gradient descent optimizer. A value of 0 will determine the learning rate as a function of Neff [default: 0]
GD: Do not use decaying learning rate (--no-decay): Do not use decaying learnign rates. Decay is started when convergence criteria falls below value of <decay_start> [default: False]
GD: Start of decay (--decay-start <decay_start>): Decay is started when convergence criteria falls below value of <decay_start> [default: 1e-1]
GD: Rate of decay (--decay-rate <decay_rate>): The rate of decay for the learning rate [default: 5e-6]
GD: Type of decay (--decay-type <decay_type>): The type of decay. One of: 'sig', 'sqrt', 'exp', 'lin' [default: 'sig']
Convergence criteria (--epsilon <eps>): relative change of norm of parameters in last <convergence_prev> iterations less than . [default: 1e-8]
Number of previous iterations (--convergence_prev <convergence_prev>): The number of previous iterations to consider for evaluating convergence criteria [default: 5]

Optimize with Constraints from Structure

PDB file (--pdb-file <pdbfile>): Reference structure for protein family
True Positive Contact (--contact-threshold <contact_thr>): Definition for residue pairs forming a contact regarding the distance of their Cbeta atoms in angstrom [default: 8]

Writing corrected Contact Matrices

Average Product Correction (APC) (--apc <apc_file>): Write out contact matrix file that has been corected with APC
Entropy Correction (--entropy-correction <entropy_correction_file>): Write out contact matrix file that has been corrected for entropy bias

Sequence Weighting Options

'Simple' weighting (--wt-simple): Use simple sequence weights calculated as w_n = 1/(1 + ID_n) where ID_n is the number of other sequences in the MSA with at least <wt_cutoff> * 100 percent sequence identity to sequence n). [default]
Uniform weights (--wt-uniform): Assign w_n = 1 to all sequences.
Sequence Identity Cutoff (--wt-cutoff <wt_cutoff>): Consider sequences as similar that have more than <wt_cutoff> * 100 percent identical positions

L2 Regularization Options

regularization coefficient singles (--reg-lambda-single <λsingle>): Set the regularization coefficient <λsingle> for the single potentials [default: 10]
regularization coefficient pairs (--reg-lambda-pair-factor <λpair_factor>): Set the regularization coefficient for couplings to <λpair> = L x <λpair_factor> given an alignment with L columns [default: 0.2]
Zero-centered Prior (--v-zero): Set mu=0 in standard Gaussian prior for single potentials and initialise single potentials at zero. [default: False]
PWM-centered Prior (--v-center): Set mu=v* in standard Gaussian prior for single potentials and initialise single potentials at v*. v* represents the PWM model computed from the alignment. [default: True]

Gap treatment

max gaps per position (--max-gap-pos <max_gap_pos>): Ignore alignment positions with more than <max_gap_pos> percent gaps. [default: 100 --> no removal]
max gaps per sequence (--max-gap-seq <max_gap_seq>): Remove sequences with more than <max_gap_seq> percent gaps. [default: 100 --> no removal]

Adding Pseudocounts

Add uniform pseudocounts (--pc-uniform): Use uniform pseudocounts, e.g 1/21 [default]
Add substitution matrix pseudocounts (--pc-submat): Use substitution matrix pseudocounts from Blosum62 matrix.
Add constant pseudocounts (--pc-constant): Use constant pseudocounts
Add no pseudocounts (--pc-none): Do not add pseudocounts
Pseudocount for singles (--pc-single-count <pc_single>): Specify number of pseudocounts for single frequencies [default: 1]
Pseudocount for pairs (--pc-pair-count <pc_pair>): Specify number of pseudocounts for pairwise frequencies [default: 1]

Compute Coevolution Scores from Alternative Models

OMES (--compute-omes): Compute OMES score as in Kass and Horovitz (2002)
OMES (variant) (--omes-fodoraldrich): Compute OMES score according to Fodor & Aldrich (2004)
Mutual information (MI) (--compute-mi): Compute mutual information (MI)
normalized MI (-mi-normalized): Normalize mutual information according to Martin et al (2005)
MI with pseudo-counts (--mi-pseudocounts): Add psuedocounts when computing mutual information

Examples

Simple example

Recover couplings for data/1atzA.fas and write summed score matrix to data/1atzA.noapc.mat and APC corrected score matrix to data/1atzA.apc.mat and the raw coupling potentials to data/1atzA.braw.gz:

ccmpred data/1atzA.fas \
	-m data/1atzA.noapc.mat \
        --apc data/1atzA.apc.mat \
	-b data/1atzA.braw.gz

Comparison with CCMpred

Per default (--ofn-pll) CCMpredPy maximizes the pseudo-likelihood to obtain couplings. Results differ slightly from the C implementation of CCMpred due to the following modifications:

single potential regularization prior is centered at maximum-likleihood estimate of single potentials v*
single potentials are initialized at v*
regularization strength lambda_v = 10 in order to achieve comparable results to C implementation of CCMpred
using the LBFGS optimizer instead of the conjugate gradient optimizer libconjugrad used in CCMpred. LBFGS yields comparable results while being faster but requiring more memory than conjugate gradients

Home

Getting started

Usage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly