Skip to content

CCMpredPy

Susann Vorberg edited this page Jun 11, 2018 · 7 revisions

Learn a Markov random field (MRF) model of evolutionary couplings between protein residues from a multiple sequence alignment (MSA) representative of the protein family.

ccmpred [options] alnfile

alnfile is an input multiple protein sequence alignment to learn couplings from.

General Options

  • MSA format (--aln-format <format>): Specify which format to parse input alignments in. Supports all BioPython Bio.SeqIO file formats plus psicov.
  • Number of threads (--num-threads <n>): The number of threads used to parallelize specific computations.
  • Printing of logo (--no-logo): Disable showing the CCMpred logo
  • Number of iterations (--maxit <max_it>): The maximum number of iterations to optimize the MRF model for. [default: 2000]

Output Options

  • Contact matrix output (-m, --mat-file <matfile>): Compute and write out contact matrix files
  • MessagePack binary raw output (-b, --write-binary-raw <rawfile>): Write out raw coupling potentials in a compact binary MessagePack representation
  • Optimization progress output (--plot-opt-progress ) Continously plot optimization progress as an interactive HTML

Optional Input Options

  • Initial potentials (-i, --init-from-raw <rawfile>): Load initial parameters from a raw coupling file and start optimizing from there
  • Skip Optimization (--do-not-optimize): Do not optimize model parameters. Requires providing initial model parameters with -i.

Options for Pseudo-Likelihood Optimization

  • Maximize Pseudo-Likelihood (--ofn-pll): Learn MRF model by maximizing the pseudo-likelihood with the LBFGS algorithm.
  • LBFGS Ftol (--lbfgs-ftol <ftol>): convergence criterion ftol for LBFGS algorithm. [default: 1e-4]
  • LBFGS Max Linesearch (--lbfgs-max-linesearch <max_ln>): maximum number of linesearch steps. [default: 5]
  • LBFGS Maxcor (--lbfgs-maxcor <max_cor>): maximum number of corrections for memory. [default: 5]

The LBFGS optimizer uses the scipy.optimize.min library with method='L-BFGS-B'

Options for Optimization with Persistent Contrastive Divergence

  • Maximize with Contrastive Divergence (--ofn-cd): Learn MRF model by applying the contrastive divergence algorithm with the gradient descent (GD) algorithm.
  • Persistent Markov Chains (--persistent): Switch to persistent contrastive divergence once the learning rate is small enough (< alpha_0 / 10) [default: False]
  • Number of Markov Chains (--nr-markov-chains <nr_mc>): Number of parallel Markov chains used for sampling sequences in each iteration of the optimization. [default: 500]
  • Number of Gibbs steps (--gibbs_stepss <gibbs_steps>): Number of Gibbs steps used to evolve each Markov chain [default: 1]
  • GD: Initial learning rate (--alpha0 <alpha0>): Set the initial learning rate for the gradient descent optimizer. A value of 0 will determine the learning rate as a function of Neff [default: 0]
  • GD: Do not use decaying learning rate (--no-decay): Do not use decaying learnign rates. Decay is started when convergence criteria falls below value of <decay_start> [default: False]
  • GD: Start of decay (--decay-start <decay_start>): Decay is started when convergence criteria falls below value of <decay_start> [default: 1e-1]
  • GD: Rate of decay (--decay-rate <decay_rate>): The rate of decay for the learning rate [default: 5e-6]
  • GD: Type of decay (--decay-type <decay_type>): The type of decay. One of: 'sig', 'sqrt', 'exp', 'lin' [default: 'sig']
  • Convergence criteria (--epsilon <eps>): relative change of norm of parameters in last <convergence_prev> iterations less than . [default: 1e-8]
  • Number of previous iterations (--convergence_prev <convergence_prev>): The number of previous iterations to consider for evaluating convergence criteria [default: 5]

Optimize with Constraints from Structure

  • PDB file (--pdb-file <pdbfile>): Reference structure for protein family
  • True Positive Contact (--contact-threshold <contact_thr>): Definition for residue pairs forming a contact regarding the distance of their Cbeta atoms in angstrom [default: 8]

Writing corrected Contact Matrices

  • Average Product Correction (APC) (--apc <apc_file>): Write out contact matrix file that has been corected with APC
  • Entropy Correction (--entropy-correction <entropy_correction_file>): Write out contact matrix file that has been corrected for entropy bias

Sequence Weighting Options

  • 'Simple' weighting (--wt-simple): Use simple sequence weights calculated as w_n = 1/(1 + ID_n) where ID_n is the number of other sequences in the MSA with at least <wt_cutoff> * 100 percent sequence identity to sequence n). [default]
  • Uniform weights (--wt-uniform): Assign w_n = 1 to all sequences.
  • Sequence Identity Cutoff (--wt-cutoff <wt_cutoff>): Consider sequences as similar that have more than <wt_cutoff> * 100 percent identical positions

L2 Regularization Options

  • regularization coefficient singles (--reg-lambda-single <λsingle>): Set the regularization coefficient <λsingle> for the single potentials [default: 10]
  • regularization coefficient pairs (--reg-lambda-pair-factor <λpair_factor>): Set the regularization coefficient for couplings to <λpair> = L x <λpair_factor> given an alignment with L columns [default: 0.2]
  • Zero-centered Prior (--v-zero): Set mu=0 in standard Gaussian prior for single potentials and initialise single potentials at zero. [default: False]
  • PWM-centered Prior (--v-center): Set mu=v* in standard Gaussian prior for single potentials and initialise single potentials at v*. v* represents the PWM model computed from the alignment. [default: True]

Gap treatment

  • max gaps per position (--max-gap-pos <max_gap_pos>): Ignore alignment positions with more than <max_gap_pos> percent gaps. [default: 100 --> no removal]
  • max gaps per sequence (--max-gap-seq <max_gap_seq>): Remove sequences with more than <max_gap_seq> percent gaps. [default: 100 --> no removal]

Adding Pseudocounts

  • Add uniform pseudocounts (--pc-uniform): Use uniform pseudocounts, e.g 1/21 [default]
  • Add substitution matrix pseudocounts (--pc-submat): Use substitution matrix pseudocounts from Blosum62 matrix.
  • Add constant pseudocounts (--pc-constant): Use constant pseudocounts
  • Add no pseudocounts (--pc-none): Do not add pseudocounts
  • Pseudocount for singles (--pc-single-count <pc_single>): Specify number of pseudocounts for single frequencies [default: 1]
  • Pseudocount for pairs (--pc-pair-count <pc_pair>): Specify number of pseudocounts for pairwise frequencies [default: 1]

Compute Coevolution Scores from Alternative Models

  • OMES (--compute-omes): Compute OMES score as in Kass and Horovitz (2002)
  • OMES (variant) (--omes-fodoraldrich): Compute OMES score according to Fodor & Aldrich (2004)
  • Mutual information (MI) (--compute-mi): Compute mutual information (MI)
  • normalized MI (-mi-normalized): Normalize mutual information according to Martin et al (2005)
  • MI with pseudo-counts (--mi-pseudocounts): Add psuedocounts when computing mutual information

Examples

Simple example

Recover couplings for data/1atzA.fas and write summed score matrix to data/1atzA.noapc.mat and APC corrected score matrix to data/1atzA.apc.mat and the raw coupling potentials to data/1atzA.braw.gz:

ccmpred data/1atzA.fas \
	-m data/1atzA.noapc.mat \
        --apc data/1atzA.apc.mat \
	-b data/1atzA.braw.gz

Comparison with CCMpred

Per default (--ofn-pll) CCMpredPy maximizes the pseudo-likelihood to obtain couplings. Results differ slightly from the C implementation of CCMpred due to the following modifications:

  • single potential regularization prior is centered at maximum-likleihood estimate of single potentials v*
  • single potentials are initialized at v*
  • regularization strength lambda_v = 10 in order to achieve comparable results to C implementation of CCMpred
  • using the LBFGS optimizer instead of the conjugate gradient optimizer libconjugrad used in CCMpred. LBFGS yields comparable results while being faster but requiring more memory than conjugate gradients