Conservation Calculation Methods

Implementation of several known methods used for calculation of conservation score of protein residues from MSA, and evaluation of their impact on P2Rank pocket prediction.

This repo contains a program conservscore written in Go that can calculate conservation scores using a specified method, and supplementary Bash scripts for easier usage and evaluation.

This project is for Matfyz Summer of Code 2020 (cz) organized by Faculty of Mathematics and Physics, Charles University, Czech Republic.

Background

The function of proteins depends on their interactions with other molecules. Really important is an interaction between proteins and small molecules (=ligands). Lots of today's drugs are small molecules, which inhibit the function of some protein by binding to its active site.

Detection of these active sites is experimentally difficult, so there are various methods for active sites prediction. One of the best ones is P2Rank developed at MFF UK. It is known that residues of active sites are more evolutionary conserved than others. This feature enhanced the P2Rank prediction, but only one conservation calculation method was used.

The main goal of this project is to implement various known conservation calculation methods, and evaluate their impact on P2Rank prediction.

How it works

Each P2Rank dataset should have files containing conservation scores in <dataset>/conservation/e5i1/scores. The files look like this:

# /tmp/file4i81hc -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number	score	column

0	-1000.00000	-------------------M----M--M-------------M-------------M--
1	0.57284	-M-----------MMMMMMAMMMMVMMTMMMMMMMMMMMM-TMMMMMMMMMMMMMPMM
2	0.52533	-TMMMMM--MMMMQQQQQQQLLFFQYLFFLLLLFYLLLLL-LLLLLLLLLLLLLLLLL
3	0.62155	DVSSSSSSSSSSSDDDDDDDDDDDDDDDDDDDDDDDDDDD-DDDDDDDDDDDDDDDDD
4	0.54686	ASIIIIVIIIILFAAAAAAAAAIAAAAAVAAAAAAAAAAA-VAAAAAAAAAAAAAAAA

In this project, these files are used as input files for conservscore. This program extracts the MSA columns, calculates their conservation scores using a specified method and window size (both are passed as arguments to conservscore), and saves the scores to an output file that looks similar to the input file:

0	0.00000	-------------------M----M--M-------------M-------------M--
1	0.51554	-M-----------MMMMMMAMMMMVMMTMMMMMMMMMMMM-TMMMMMMMMMMMMMPMM
2	0.46502	-TMMMMM--MMMMQQQQQQQLLFFQYLFFLLLLFYLLLLL-LLLLLLLLLLLLLLLLL
3	0.71309	DVSSSSSSSSSSSDDDDDDDDDDDDDDDDDDDDDDDDDDD-DDDDDDDDDDDDDDDDD
4	0.59215	ASIIIIVIIIILFAAAAAAAAAIAAAAAVAAAAAAAAAAA-VAAAAAAAAAAAAAAAA

Next, P2Rank loads these files together with its regular input files, and they are used for training of new models and their evaluation.

Setup

You should have P2Rank installed and relevant datasets downloaded (chen11 and coach420 are used by default).

Go has to be installed in order to build the conservscore program.

Since the scripts are in Bash, it should run on Linux without a problem. On Windows, you may need to either build conservscore and execute commands manually, or have, for example, Git Bash or Windows Subsystem for Linux installed in order to run the Bash scripts.

You might have to change some paths of commands and directories in p2rank-eval-method script.

Usage

The p2rank-eval-method script helps with the whole process:

It prepares the files with conservation score: builds conservscore, recalculates them using the specified method and collects them.
It runs P2Rank that trains and evaluates 10 new models.

./p2rank-eval-method.sh -m METHOD [-w WINDOW]

Available methods:

Shannon entropy of residues: shannon-entropy
Shannon entropy of residue properties: property-entropy
relative entropy (Kullback–Leibler divergence): relative-entropy
Jensen-Shannon divergence: jensen-shannon-divergence
sum-of-pair measure: sum-of-pairs

If the window is greater than 0, a score of a column is affected by nearby column scores. (the value is a number of residues on either side included in the window)

By default, the window size is 0; dataset chen11 is used for training and coach420(mlig) for evaluation. You can change some options in the scripts or in the conservscore program.

Usage for conservscore program

Tip: to measure the duration of the execution of a command/script/program and to print its output to both the terminal and a log file, use:

time SOME_COMMAND 2>&1 | tee log/test.log

Evaluation

To evaluate the impact of different methods (or configurations) on P2Rank prediction, the following tests were made:

without conservation files
with default conservation files
with custom conservation files recalculated using a specified method and the window size of 0 and 3.
methods:
- Shannon entropy of residues: The most simple and commonly used method.
- Shannon entropy of residue properties: Amino acids are grouped by biochemical similarities.
- relative entropy (Kullback–Leibler divergence): Takes into account background distribution (how rare an amino acid is).
- Jensen-Shannon divergence: Based on the relative entropy, symmetric and has finite value.
- sum-of-pairs measure: Uses a similarity matrix (takes into account how similar amino acids are).

This configuration was used:

P2Rank:
- config file: config/conservation (if with conservation files)
- traineval training dataset: chen11.ds
- traineval evaluation dataset: coach420(mlig).ds
- loop: 10 (trains 10 models, each with a different seed)
- seed: 42 (first seed)
- rf_trees: 200
conservscore: (only for the custom conservation files)
- pseudocount: 10e-6 (0.000001)
- use sequence weights: true
- max. gap percentage: 30%
- use gap penalty: true
- similarity matrix and background distribution: BLOSUM62
- window lam: 0.5
- replace negative scores with 0: true

The only difference between the configurations of default and custom conservation files is that the default conservation files were probably calculated using Jensen-Shannon divergence, window size of 3 and pseudocount of 10e-7.

Results

The following table shows the results of the tests.

DCA(4.0) metric is used: the pocket prediction is considered successful if the distance between the pocket center and any ligand atom is less then (or equal to) 4 Å.
[0] denotes Top-(n+0) and [2] Top-(n+2) success rate where n is the number of ligands of a protein. In order to get Top-(n+0) and Top-(n+2) success rate of 100%, the predicted pockets have to be within the first n and n+2 positions, respectively.
These numbers are always an average score of 10 runs (each with a different seed), and were produced by P2Rank.

Shannon entropy means Shannon entropy of residues, and Property entropy means Shannon entropy of residue properties.
By default, the window size is 0 (no window), the w3 denotes window size of 3 residues on each side.

Imp. is the improvement over using the default conservation files.

Model / Method	DCA(4.0) [0]	Imp.	DCA(4.0) [2]	Imp.
Without conservation	68.3		72.6
Default conservation files	72.0	0.0%	75.4	0.0%
Shannon entropy	72.2	+0.3%	76.5	+1.5%
Shannon entropy (w3)	71.6	-0.6%	75.1	-0.4%
property entropy	72.7	+1.0%	76.3	+1.2%
property entropy (w3)	71.9	-0.1%	75.3	-0.1%
relative entropy	73.7	+2.4%	77.8	+3.2%
relative entropy (w3)	72.0	0.0%	76.0	+0.8%
Jensen-Shannon divergence	73.4	+1.9%	77.5	+2.8%
Jensen-Shannon divergence (w3)	72.2	+0.3%	75.6	+0.3%
sum-of-pairs measure	72.4	+0.6%	76.4	+1.3%
sum-of-pairs measure (w3)	71.9	-0.1%	75.2	-0.3%

The results show that P2Rank prediction is generally improved by including the conservation scores.

Both relative entropy and Jensen-Shannon divergence performed better than other tested methods, but most importantly better than using the default conservation files. Relative entropy has a slightly better score than Jensen-Shannon divergence.

Using the window definitely lowers the score (at least window size of 3). In my opinion, the reason might be that this window is sequential, however P2Rank uses its own spacial window that is more important for the pocket prediction.

In conclusion, relative entropy (or Jensen-Shannon divergence) without a window can be used to improve P2Rank prediction.

Sources

Capra JA and Singh M.
Predicting functionally important residues from sequence conservation.
Bioinformatics. 23(15): 1875-1882, 2007.

Radoslav Krivák and David Hoksza.
P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure.
Journal of Cheminformatics. Aug 2018

Lukáš Jendele and Radoslav Krivák and Petr Škoda and Marian Novotný and David Hoksza.
PrankWeb: a web server for ligand binding site prediction and visualization.
Nucleic Acids Research. May 2019

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
conservscore		conservscore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
p2rank-eval-method.sh		p2rank-eval-method.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conservation Calculation Methods

Table of Contents

Background

How it works

Setup

Usage

Evaluation

Results

Sources

About

Releases

Packages

Languages

License

sykoram/conservation-calculation-methods

Folders and files

Latest commit

History

Repository files navigation

Conservation Calculation Methods

Table of Contents

Background

How it works

Setup

Usage

Evaluation

Results

Sources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages