-
Notifications
You must be signed in to change notification settings - Fork 21
Aligning proteins
First, download the DeepBLAST pretrained model
wget https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-l8.ckpt
It is recommended to download the ProTrans model from huggingface so that you have a local copy of it.
git lfs install
git clone https://huggingface.co/Rostlab/prot_t5_xl_uniref50
If you run the command line version, this is not necessary since the Protrans model will be automatically downloaded by default.
Once those two models are downloaded, you can load the DeepBLAST model.
from deepblast.utils import load_model
model = load_model("deepblast-l8.ckpt", "prot_t5_xl_uniref50").cuda()
model = load_model("deepblast-l8.ckpt", "prot_t5_xl_uniref50", device='cpu')
As another note, the load_model
function as an option to allow to specify what type of alignment you want to perform inference using the alignment_mode
option. You can either specify needleman-wunch
for global alignment or smith-waterman
for local alignment.
Once the model is loaded, we can test out DeepBLAST by structurally aligning two proteins using only their sequences
x = 'IGKEEIQQRLAQFVDHWKELKQLAAARGQRLEESLEYQQFVANVEEEEAWINEKMTLVASED'
y = 'QQNKELNFKLREKQNEIFELKKIAETLRSKLEKYVDITKKLEDQNLNLQIKISDLEKKLSDA'
# obtains alignment string specifying structural superposition
pred_alignment = model.align(x, y)
The resulting alignment specifies which residues are aligned. :
indicates matches, 1
indicates residues matched to sequence 1 (aka insertions) and 2
indicates residues matched to sequence 2 (aka deletions). To make this more human readable, we can directly visualize the alignment.
from deepblast.dataset.utils import states2alignment
x_aligned, y_aligned = states2alignment(pred_alignment, x, y)
print(x_aligned)
print(pred_alignment)
print(y_aligned)
Output
-IGKEEIQQRLAQFVDHWKELKQLAAARGQRLEESLEYQ-QFVANVEEEEAWINEKMTLVASED
21:::::::::::::::::::::::::::::::::::::2::::::::::::::::::::::1:
Q-QNKELNFKLREKQNEIFELKKIAETLRSKLEKYVDITKKLEDQNLNLQIKISDLEKKLSD-A