AlphaFold-baseline Video
This package provides an basic implementation of the contact prediction network used in AlphaFold 1 for beginner, associated model weights and CASP13 dataset as used for CASP13 (2018) and published in Nature. This is completely different code from that used in AlphaFold 2 which was used in CASP14 (2020). You can find AlphaFold 2 at https://github.com/deepmind/alphafold also as seen below AlphaFold 2.0.
Any publication that discloses findings arising from using this source code must cite Improved protein structure prediction using potentials from deep learning by Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis.
The paper abstract can be found on Nature's site 10.1038/s41586-019-1923-7 and the full text can be accessed directly at https://rdcu.be/b0mtx.
====
- Setup
- Data
- Distogram prediction
- Data splits
- Features
- AlphaFold 2.0
- AlphaFold Protein Structure Database
- Disclaimer
This code can't be used to predict structure of an arbitrary protein sequence. It can be used to predict structure only on the CASP13 dataset (links below). The feature generation code is tightly coupled to our internal infrastructure as well as external tools, hence we are unable to open-source it. We give guide as to the features used for those accustomed to computing them below. See also issue #18 for more details.
This code works on Linux, we don't support other operating systems.
- Python 3.6+.
- Abseil 0.8.0
- Numpy 1.16
- Six 1.12
- Setuptools 41.0.0
- Sonnet 1.35
- TensorFlow 1.14. Not compatible with TensorFlow 2.0+.
- TensorFlow Probability 0.7.0
You can set up Python virtual environment (you might need to install the
python3-venv
package first) with all needed dependencies inside the forked
deepmind_research
repository using:
python3 -m venv alphafold_venv
source alphafold_venv/bin/activate
pip install wheel
pip install -r requirements.txt
Alternatively, you can just use the run_eval.sh
script provided which will run
these commands for you. See the section on running the system below for more
details.
AlphaFold-baseline
├── alphafold_casp13
| ├── asa_output.py
| ├── config_dict.py
| ├── contacts.py
| ├── distogram_io.py
| ├── parsers.py
| ├── secstruct.py
| ├── two_dim_convnet.py
| └── two_dim_resnet.py
├── alphafold_venv --created in #Dependencies section
├── alphafold_casp13_weights --download form #Data section
├── casp13_data --download form #Data section
├── requirements.txt
├── LICENSE
├── test_domains.txt
├── train_domains.txt
└── run_eval.sh
Figure 1. file structure in Unbuntu 18.04
While the code is licensed under the Apache 2.0 License, the AlphaFold weights and data are made available for non-commercial use only under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode
You can download the data from:
- http://bit.ly/alphafold-casp13-data-license - The data license file.
- http://bit.ly/alphafold-casp13-data - The dataset to reproduce AlphaFold's CASP13 results.
- http://bit.ly/alphafold-casp13-weights - The model checkpoints.
The dataset to reproduce AlphaFold's CASP13 results can be downloaded from
http://bit.ly/alphafold-casp13-data. The dataset is in a single zip file called
casp13_data.zip
which has about 43.5 GB.
The zip file contains 1 directory for each CASP13 target and a LICENSE.txt
file. Each target directory contains the following files:
TARGET.tfrec
file. This is a TFRecord file with serialized tf.train.Example protocol buffers that contain the features needed to run the model.contacts/TARGET.pickle
file(s) with the predicted distogram. These pickles were pickled using Python 2, so to unpickle them in Python 3 you will need to set theencoding='latin1'
optional argument forpickle.load()
.contacts/TARGET.rr
file(s) with the contact map derived from the predicted distogram. The RR format is described on the CASP website.
Note that for T0999 the target was manually split based on hits in HHSearch
into 5 sub-targets, hence there are 5 distograms
(contacts/T0999s{1,2,3,4,5}.pickle
) and 5 RR files
(contacts/T0999s{1,2,3,4,5}.rr
).
The contacts/
folder is not needed to run the model, these files are included
only for convenience so that you don't need to run the inference for CASP13
targets to get the contact map.
The model checkpoints can be downloaded from
http://bit.ly/alphafold-casp13-weights. The model checkpoints are in a zip file
called alphafold_casp13_weights.zip
which has about 210 MB.
The zip file contains:
- A directory
873731
. This contains the weights for the distogram model. - A directory
916425
. This contains the weights for the background distogram model. - A directory
941521
. This contains the weights for the torsion model. LICENSE.txt
. The model checkpoints have a non-commercial license which is defined in this file.
Each directory with model weights contains a number of different model configurations. Each model has a config file and associated weights. There is only one torsion model. Each model directory also contains a stats file that is used for feature normalization specific to that model.
You can use the run_eval.sh
script to run the entire Distogram prediction
system. There are a few steps you need to start with:
- Download the input data as described above. Unpack the data in the directory with the code.
- Download the model checkpoints as described above. Unpack the data.
- In
run_eval.sh
set the following:DISTOGRAM_MODEL
to the path to the directory with the distogram model.BACKGROUND_MODEL
to the path to the directory with the background model.TORSION_MODEL
to the path to the directory with the torsion model.TARGET
to the name of the target.TARGET_PATH
to the path to the directory with the target input data.OUTPUT_DIR
is by default set to a new directory with a timestamp within your home directory.
QUICK START
Then run
bash run_eval.sh
in the python virtual enviromentalphafold-venv
created.
The contact prediction works in the following way:
- 4 replicas (by replica we mean a configuration file describing the network architecture and a snapshot with the network weights), each with slightly different model configuration, are launched to predict the distogram.
- 4 replicas, each with slightly different model configuration are launched to predict the background distogram.
- 1 replica is launched to predict the torsions.
- The predictions from the different replicas are averaged together using
ensemble_contact_maps.py
. - The predictions for the 64 × 64, 128 × 128 and 256 × 256 distogram crops are
pasted together using
paste_contact_maps.py
.
When running run_eval.sh
the output has the following directory structure:
- distogram/: Contains 4 subfolders, one for each replica. Each of these
contain the predicted ASA, secondary structure and a pickle file with the
distogram for each crop (see below for more details). It also contains an
ensemble
directory with the ensembled distograms. - background_distogram/: Contains 4 subfolders, one for each replica. Each
of these contain a pickle file with the background distogram for each crop.
It also contains an
ensemble
directory with the ensembled background distograms. - torsion/: Contains 1 subfolder as there was only a single replica. This
folder contains contains the predicted ASA, secondary structure, backbone
torsions and a pickle file with the distogram for each crop. It also
contains an
ensemble
directory, which contains a copy of the predicted output as there is only a single replica in this case. - pasted/: Contains distograms obtained from the ensembled distograms by pasting. An RR contact map file is computed from this pasted distogram. This is the final distogram that was used in the subsequent AlphaFold folding pipeline in CASP13.
The distogram is a Python pickle file with a dictionary containing the following fields:
min_range
: The minimum range in Angstroms to consider in distograms.max_range
: The range in Angstroms to consider in distograms, seenum_bins
below for clarification. The upper end of the distogram ismin_range + max_range
.num_bins
: The number of bins in the distance histogram being predicted. We divide the interval frommin_range
tomin_range + max_range
into this many bins. The distograms were trained so that distances lower thanmin_range
were counted in the lowest bin and distances higher thanmin_range + max_range
were added to the final bin. Thenum_bins - 1
boundaries between bins are thusnp.linspace(0, max_range, num_bins + 1, endpoint=True)[1:-1] + min_range
.sequence
: The target sequence of amino acids of lengthL
.target
: The name of the target.domain
: The name of the target including the domain name.probs
: The distogram as a Numpy array of shape[L, L, num_bins]
.
We used a version of PDB downloaded on 2018-03-15. The
train/test split can be found in the train_domains.txt
and test_domains.txt
files in this repository. The split is based on the
CATH 2018-03-16 database.
There is currently no plan to open source the feature generation code as it is tightly coupled to our internal infrastructure as well as external tools which we cannot open source.
Some features are needed only as placeholders to construct the model. These can be set to all zeros when running the inference. Such features are marked in the table below as not needed and you can just fill them with zeros when running inference.
The table below provides an overview of the features we used to make it possible
to reconstruct our feature generation code. Some features that require more
thorough explanation are explained in the section below the table. Note that
NR
stands for number of residues, i.e. the length of the amino acid sequence:
📑 Click to Expand
Name | Needed | TF DType | Shape | Description |
---|---|---|---|---|
aatype |
✔️ | float32 | (NR, 21) |
One hot encoding of amino acid types. The mapping is ARNDCQEGHILKMFPSTWYVX -> range(21) . See below. |
alpha_mask |
❌ | int64 | (NR, 1) |
Mask for alpha_positions . |
alpha_positions |
❌ | float32 | (NR, 3) |
(x, y, z) Carbon Alpha coordinates. |
beta_mask |
❌ | int64 | (NR, 1) |
Mask for beta_positions . |
beta_positions |
❌ | float32 | (NR, 3) |
(x, y, z) Carbon Beta coordinates. |
between_segment_residues |
❌ | int64 | (NR, 1) |
The number of between segment residues (BSR) at the next position. E.g. ABCXXD (XX is BSR) would be [0,0,2,0] . |
chain_name |
❌ | string | (1) |
The chain name. E.g. 'A', 'B', ... |
deletion_probability |
✔️ | float32 | (NR, 1) |
The fraction of sequences that had a deletion at this position. See below. |
domain_name |
❌ | string | (1) |
The domain name. |
gap_matrix |
✔️ | float32 | (NR, NR, 1) |
Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below. |
hhblits_profile |
❌ | float32 | (NR, 22) |
A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'. |
hmm_profile |
✔️ | float32 | (NR, 30) |
The HHBlits HHM profile (from the -ohhm HHBlits output file). Asterisks in the output are replaced by 0.0. See below. |
key |
❌ | string | (1) |
The unique id of the protein. |
mutual_information |
❌ | float32 | (NR, NR, 1) |
The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604. |
non_gapped_profile |
✔️ | float32 | (NR, 21) |
A profile from amino acids only (discounting gaps). See below. |
num_alignments |
✔️ | int64 | (NR, 1) |
The number of HHBlits multiple sequence alignments. Has to be repeated NR times. See below. |
num_effective_alignments |
❌ | float32 | (1) |
The number of effective alignments (neff at 62 % sequence similarity). |
phi_angles |
❌ | float32 | (NR, 1) |
The phi angles. |
phi_mask |
❌ | int64 | (NR, 1) |
Mask for phi_angles . |
profile |
❌ | float32 | (NR, 21) |
A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse. |
profile_with_prior |
✔️ | float32 | (NR, 22) |
A profile computed using HHBlits which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389. |
profile_with_prior_without_gaps |
✔️ | float32 | (NR, 21) |
Same as profile_with_prior but without gaps included. |
pseudo_bias |
✔️ | float32 | (NR, 22) |
The bias computed in the MSA pseudolikelihood computation. |
pseudo_frob |
✔️ | float32 | (NR, NR, 1) |
Frobenius norm of pseudolikelihood (gaps not included). Similar to the output of CCMPred. |
pseudolikelihood |
✔️ | float32 | (NR, NR, 484) |
The weights computed in the MSA pseudolikelihood computation. |
psi_angles |
❌ | float32 | (NR, 1) |
The psi angles. |
psi_mask |
❌ | int64 | (NR, 1) |
Mask for psi_angles . |
residue_index |
✔️ | int64 | (NR, 1) |
Index of each residue giong from 0 to NR - 1 . See below. |
resolution |
❌ | float32 | (1) |
The protein structure resolution. |
reweighted_profile |
✔️ | float32 | (NR, 22) |
Profile where sequences are reweighted to weight rarer sequences higher. See below. |
sec_structure |
❌ | int64 | (NR, 8) |
Secondary structure generated by DSSP and one-hot encoded by the mapping -HETSGBI -> range(8) . |
sec_structure_mask |
❌ | int64 | (NR, 1) |
Mask for sec_structure_mask . |
seq_length |
✔️ | int64 | (NR, 1) |
The length of the amino acid sequence. Has to be repeated NR times. See below. |
sequence |
✔️ | string | (1) |
The amino acid sequence (1-letter amino acid encoding). See below. |
solv_surf |
❌ | float32 | (NR, 1) |
Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility. |
solv_surf_mask |
❌ | int64 | (NR, 1) |
Mask for solv_surf . |
superfamily |
❌ | string | (1) |
The superfamily CATH code. |
💡 More details on needed features
One hot encoding of amino acid types. The following code converts an amino acid string into the one-hot encoding:
def sequence_to_onehot(sequence):
"""Maps the given sequence into a one-hot encoded matrix."""
mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
num_entries = max(mapping.values()) + 1
one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)
for aa_index, aa_type in enumerate(sequence):
aa_id = mapping[aa_type]
one_hot_arr[aa_index, aa_id] = 1
return one_hot_arr
The fraction of sequences that had an insert state (denoted by a lowercase letter in the A3M format) at this position. We used the following code to compute it from the HHBlits MSA in the A3M format:
deletion_matrix = []
for msa_sequence in hhblits_a3m_sequences:
deletion_vec = []
deletion_count = 0
for j in msa_sequence:
if j.islower():
deletion_count += 1
else:
deletion_vec.append(deletion_count)
deletion_count = 0
deletion_matrix.append(deletion_vec)
deletion_matrix = np.array(deletion_matrix)
deletion_matrix[deletion_matrix != 0] = 1.0
deletion_probability = deletion_matrix.sum(axis=0) / len(deletion_matrix)
Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. Example:
MSA = A A C D B D F J G B M A
- - C D B D F J G B M A
A A C D B - - J G B M A
gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
gap_matrix = np.matmul(gap_count.T, gap_count)
The HHBlits HHM profile (from the -ohhm
HHBlits output file). Asterisks in the
output are replaced by 0.0. The following code parses the HHM file:
def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
"""Extracts information from the hmm file and replaces asterisks."""
profile_part = hhm_file.split('#')[-1]
profile_part = profile_part.split('\n')
whole_profile = [i.split() for i in profile_part]
# This part strips away the header and the footer.
whole_profile = whole_profile[5:-2]
gap_profile = np.zeros((len(sequence), 10))
aa_profile = np.zeros((len(sequence), 20))
count_aa = 0
count_gap = 0
for line_values in whole_profile:
if len(line_values) == 23:
# The first and the last values in line_values are metadata, skip them.
for j, t in enumerate(line_values[2:-1]):
aa_profile[count_aa, j] = (
2**(-float(t) / 1000.0) if t != '*' else asterisks_replace)
count_aa += 1
elif len(line_values) == 10:
for j, t in enumerate(line_values):
if j <= 6:
gap_profile[count_gap, j] = (
2**(-float(t) / 1000.0) if t != '*' else asterisks_replace)
else:
# Neff_M, Neff_I, and Neff_D are given in units of 0.001.
gap_profile[count_gap, j] = float(t) / 1000.0
count_gap += 1
elif not line_values:
pass
else:
raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
'got %d'%(line_values, len(line_values)))
hmm_profile = np.hstack([aa_profile, gap_profile])
assert len(hmm_profile) == len(sequence)
return hmm_profile
A profile from amino acids only (discounting gaps).
def non_gapped_profile(amino_acids):
"""Computes a profile from only amino acids and discounting gaps."""
profile = np.zeros(21)
for aa in amino_acids:
if aa != 21: # Ignore gaps.
profile[aa] += 1.
return profile / np.sum(profile)
The number of HHBlits multiple sequence alignments. Has to be repeated NR
times. For example, if there are 10 alignments for a sequence of length 8, then
num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]]
.
This feature collapses the 484 channels of pseudolikelihood into one by taking the Frobenius norm of the 484 channels and then subtracting the Average Product Correction of the computed Frobenius norm. The Frobenius norm does not take into account the 22nd gap state.
Parameters of a Potts Model coupling the amino acid types of particular residues estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707 for more details.
Index of each residue giong from 0 to NR - 1
. For example, the sequence AACR
has residue_index = [[0], [1], [2], [3]]
.
Profile where sequences are reweighted to weight rarer sequences higher. The sequence weights are calculated like this:
def sequence_weights(sequence_matrix):
"""Compute sequence reweighting to weight rarer sequences higher."""
num_rows, num_res = sequence_matrix.shape
cutoff = 0.62 * num_res
weights = np.ones(num_rows, dtype=np.float32)
for i in range(num_rows):
for j in range(i + 1, num_rows):
similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
if similarity > cutoff:
weights[i] += 1
weights[j] += 1
return 1.0 / weights
The length of the amino acid sequence. Has to be repeated NR
times. For
example, the sequence AACR
would have seq_length = [[4], [4], [4], [4]]
.
The amino acid sequence (1-letter amino acid encoding). For example, a protein
with Alanine, Lysine, Arginine has sequence = 'AKR'
.
https://github.com/deepmind/alphafold/
You can use a slightly simplified version of AlphaFold with this Colab notebook or community-supported versions (see below).
Figure 2. CASP14 predictions
Figure 3. probable disease resistance protein At1g58602
This is not an official Google product.
Special thanks for DeepMind !