The code and datasets for the model UniPMT proposed in the paper: UniPMT: A Unified Deep Framework for Peptide, MHC, and TCR Binding Prediction
Operating System: Linux Ubuntu 20.04.
Software dependencies and software versions: please see ./code/requirements.txt
Hardware: CPU: Intel@ Xeon(R) Platinum 8360Y CPU @ 2.40GHzx 144, GPU: Nvidia A100 (for training); Nvidia A100 or Nvidia 3090 (for evaluation)
- Install the required packages in requirement.py:
pip install -r requirements.txt
. Normal install time: within 1 hour. - Download the datasets (see in
./data/
folder),and put the datasets in that folder, e.g.,./data/pmt_pmt/
. - Download the model file (see in
./model/
folder) and put the trained model file in that folder, e.g.,./model/model_pmt_pmt.pt
- To reproduce the PMT results:
- Modifiy the
code/config/config.py
file:data_folder = pmt_pmt
. - Run the evaluation through
python main.py
. Expected runing time: within 1 min on Nvidia 3090/5 seconds on Nvidia A100.
- Modifiy the
- To reproduce the PM results:
- Modifiy the
code/config/config.py
y file:data_folder = pm_iedbsame
- Run the evaluation through
python main.py
. Expected runing time: within 1 min on Nvidia 3090/5 seconds on Nvidia A100.
- Modifiy the
- The test results (AUC, PRAUC) in the test set. The results will be outputed on the Terminal.
- A predicted results scores of each data sample in the test set will be stored in
./output/predictions/
- For easy understanding the score, we have added the files of scores with corresponding sequences in
./output/predictions/
ending with_with_name.csv
, (e.g.,result_pm_iedbsame_with_name.csv
,result_pmt_pmt_with_name.csv
)
UniPMT Training Process
-
Data Processing and Graph Construction
- Load and preprocess datasets for P-M, P-T, and P-M-T bindings.
- Remove duplicates and anomalies from the data.
- Create edge sets E for P-M, P-T, and P-M-T bindings.
- Represent peptides (P), MHCs (M), and TCRs (T) as nodes, forming a heterogeneous graph G(V, E).
-
Initial Embedding Representation
- Generate initial embeddings for P and T nodes using the ESM method: hp, ht <- ESM(P, T)
- Generate initial embeddings for M nodes using pseudo sequences: hm <- Pseudo(M)
-
Graph Neural Network Learning
- def GraphSAGE:
- For each node ni at layer l+1: h_ni^(l+1) = ReLU(W^(l) * MEAN({h_nj^(l) | nj in Neighbors(ni)}))
- def GraphSAGE:
-
Multi-task Learning
-
def P-M Task Learning:
- Generate vector representation for P-M binding: v_pm = f_pm(hp, hm)
- Calculate P-M binding probability: P_pm = sigmoid(w_pm * v_pm)
- Compute cross-entropy loss: L_pm = -(1/N_pm) * sum(y_pm^(i) * log(P_pm^(i)) + (1 - y_pm^(i)) * log(1 - P_pm^(i)))
-
def P-M-T Task Learning:
- Reuse P-M representation v_pm.
- Generate vector representation for M-T binding: v_mt = f_mt(hm, ht)
- Calculate P-M-T binding score and probability: P_pmt = sigmoid(f_DMF(v_pm * v_mt))
- Optimize using Info-NCE contrastive learning loss: L_pmt = -(1/N_pmt) * sum(log(exp(P_pmt^(i) / tau) / (exp(P_pmt^(i) / tau) + sum(exp(P_pmt^(i,j) / tau))))
-
def P-T Task Learning:
- Aggregate P-M binding probabilities: P_pt = (1/M) * sum(P_pmjt for j in 1 to M)
- Compute cross-entropy loss: L_pt = -(1/N_pt) * sum(y_pt^(i) * log(P_pt^(i)) + (1 - y_pt^(i)) * log(1 - P_pt^(i)))
-
-
Training Process
- For each epoch:
- For each batch in the dataset:
- Update node embeddings using GraphSAGE.
- Perform P-M task learning and compute L_pm.
- Perform P-M-T task learning and compute L_pmt.
- Perform P-T task learning and compute L_pt.
- L = lambda_pm * L_pm + lambda_pmt * L_pmt + lambda_pt * L_pt
- Update model parameters through minimizing L.
- Check for convergence or stopping criteria.
- For each batch in the dataset:
- Continue training until the model converges or meets predefined stopping criteria.
- For each epoch: