-
Notifications
You must be signed in to change notification settings - Fork 10
Generate Realistic Protein Alignments
With CCMge it is possible to either generate a MCMC sample of specified size from a MRF model or generate a sequence sample along a predefined phylogeny (idealized binary/star tree or topology read from Newick file).
In order to generate a simple MCMC sample with 1000 sequences (sequence drawn after 500 steps of Gibbs sampling, Markov chain has been randomly initialized but gap structure of original sequence has been used) you can run:
ccmgen data/1atzA.braw.gz data/1atzA.mcmc.fas \
--mcmc-sampling \
--alnfile data/1atzA.fas \
--mcmc-sample-random-gapped --mcmc-burn-in 500 --num-sequences 1000
The output generated by CCMgen will look as follows:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Successfully loaded model parameters from data/1atzA.braw.gz.
Start sampling 1000 sequences according to model starting with
random-gapped sequences using burn-in=500.
sampled alignment has 1000 sequences...
Sampled alignment has Neff 2.59408
Writing sampled alignment to data/1atzA.mcmc.fas
One way to qualitatively evaluate the precision of the learned MRF model is to compare the alignment statistics (single site amino acid frequencies, pairwise amino acid frequencies and covariances) of the original alignment that was used to learn the MRF model and the statistics of an alignment generated as MCMC sample from the model.
ccm_plot aln-stats \
-o data/1atzA.pcd_mcmc_vs_original_alignment.html \
-a data/1atzA.fas \
-s data/1atzA.mcmc.fas \
--aln-format fasta
This command will generate the following plot showing good correlaton between the observed and generated alignment statistics:
In order to generate an alignment along a certain phylogenetic topology while obeying the constraints from a MRF model you can run:
ccmgen data/1atzA.braw.gz data/1atzA.binary.fas \
--tree-binary \
--mutation-rate 3 \
--seq0-mrf 10 \
--num-sequences 1024
This command will generate an alignment with 1024 sequences that are sampled according to a binary tree topology with an ancestor (=root) sequence obtained by evolving a polyA sequence with 10 Gibbs steps according to the MRF model and that have in total 3 * 75 (length of protein) mutations per position. The output generated by CCMgen will look as follows:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
Successfully loaded model parameters from data/1atzA.braw.gz.
Created binary tree with 1024 leaves, 2048 nodes, avg branch length=0.1, depth_min=1.0000e+00, depth_max=1.0000e+00
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
KKADIYFLLDGSESVGTDNFETMRHFISRVAEMFDIGFDKVRIGVVQYSRVIHLEFSLNAFSTKEALIAAIDNIQ
avg number of amino acid substitutions (parent -> child): 23.0
avg number of amino acid substitutions (root -> leave): 225.0
Alignment with 1024 sequences was sampled with mutation rate 3.0 and has Neff 2.60924
Writing sampled alignment to data/1atzA.binary.fas
It is also possible to generate an alignment of the same diversity as a specified input alignment (diversity measured as Neff as defined in HHsuite package). Note that it is helpful to use a MRF with only a small number of coupling constraints, e.g. by learning a MRF while setting couplings for non-contacts to zero (see Learning MRF with small number of constraints).
ccmgen data/1atzA.constrained.braw.gz data/1atzA.binary.fixedNeff.fas \
--tree-binary \
--mutation-rate-neff \
--alnfile data/1atzA.fas \
--seq0-mrf 10 \
--num-sequences 1024
Starting with a mutation rate = 1, CCMgen will adapt the mutation rate until an alignment with desired diversity has been generated (algorithm will be restarted after reaching a maxmimal mutation rate of 50).
The output will look similar to this:
┏━╸┏━╸┏┳┓┏━╸┏━╸┏┓╻ version 1.0.0
┃ ┃ ┃┃┃┃╺┓┣╸ ┃┗┫ Vorberg, Seemayer and Soeding (2018)
┗━╸┗━╸╹ ╹┗━┛┗━╸╹ ╹ https://github.com/soedinglab/ccmgen
Using 1 threads for OMP parallelization.
1atzA is of length L=75 and there are 3068 sequences in the alignment.
Alignment has diversity [sqrt(N)/L]=0.739 and Neff(HHsuite-like)=5.492.
Successfully loaded model parameters from data/1atzA.constrained.braw.gz.
Created binary tree with 4096 leaves, 8192 nodes, avg branch length=0.083, depth_min=1.000e+00, depth_max=1.000e+00
Sample alignment of 3068 protein sequences with target Neff~5.49181...
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
KEFDIVFLVDGSTSISQTKWEVMKPFLKKLAGGMNVSSSSYHVGLIQYSRTNQIHFNLDTHPYAKLVLVAIKDMQ
avg number of amino acid substitutions (parent -> child): 6.0
avg number of amino acid substitutions (root -> leave): 75.0
Alignment with 3068 sequences was sampled with mutation rate 1 and has Neff 3.67 (ΔNeff [%] = 33.172)
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
QPMDIAFVIDGSSNTAFDGFRQIRPFLVSFVSQIELRPGSIRVGVVQYSREPQLELPLNAHDDLSGVLNAIRDIQ
avg number of amino acid substitutions (parent -> child): 9.0
avg number of amino acid substitutions (root -> leave): 107.0
Alignment with 3068 sequences was sampled with mutation rate 1.43 and has Neff 4.2376 (ΔNeff [%] = 22.838)
[ removed the sampling output for brevity ]
Ancestor sequence (polyA --> 10 gibbs steps --> seq0) :
QKADLVFLIDASSSITNVDYSKTIDFIESVVRRFDIGPDGVQVGLITYSDVPELLIPLNKFRTKTKVLNAVRRVQ
avg number of amino acid substitutions (parent -> child): 19.0
avg number of amino acid substitutions (root -> leave): 232.0
Alignment with 3068 sequences was sampled with mutation rate 3.1 and has Neff 5.531 (ΔNeff [%] = -0.71374)
Writing sampled alignment to data/1atzA.binary.fixedNeff.fas