Skip to content

[Outdated, see new and extended repo: https://github.com/ziegler-ingo/cleavage_benchmark] Code and dataset for "Proteasomal cleavage prediction: state-of-the-art and future directions"

Notifications You must be signed in to change notification settings

ziegler-ingo/cleavage_extended

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cleavage Prediction, Extended

This repository is outdated. See the new and extended repository for the current paper version here.

This repository contains the code and dataset for the article "Proteasomal cleavage prediction: state-of-the-art and future directions".

Abstract

Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 0.4 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.

Repository Structure

  • preprocessing includes the notebooks that shuffle and split the data into C- and N-terminal, train, evaluation, and test split, as well as perform 3-mer and other basic preprocessing steps, such as tokenization
  • data holds the .csv and .tsv train, evaluation, and test files, as well as a vocabulary file
  • denoise/divide_mix holds our adjusted implementation of the DivideMix algorithm
    • try-out runs (e.g. testing impact of varying epochs, weight of unlabeled loss samples and distributions after Gaussian Mixture Model separation) can be found under denoise/dividemix_tryout_debug
    • all other tested denoising methods are directly impelemented in the notebooks
  • models/hyperparam_search holds the training and hyperparameter search implementation of the asynchronous hyperband algorithm using Ray Tune
  • models/final holds the final training and evaluation structure for model architectures paired with all denoising approaches. Ablation experiments are also included there.

Naming structure of final notebooks

  • All notebooks are named as follows: the applicable terminal, i.e. c or n, followed by the model architecture, e.g. bilstm, followed by the denoising method, e.g. dividemix
  • Example: c_bilstm_dividemix.ipynb

Available model architectures

  • BiLSTM, called bilstm
  • BiLSTM with Attention, called bilstm_att
  • BiLSTM with pre-trained Prot2Vec embeddings, called bilstm_prot2vec
  • Attention enhanced CNN, called cnn
  • BiLSTM with ESM2 representations as embeddings, called bilstm_esm2
  • Fine-tuning of ESM2, called esm2
  • BiLSTM with T5 representations as embeddings, called bilstm_t5
  • Base BiLSTM with various trained tokenizers
    • Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called bilstm_bppe1 and bilstm_bbpe50
    • WordPair tokenizer with vocabulary size 50000, called bilstm_wp50
  • BiLSTM with forward-backward representations as embeddings, called bilstm_fwbw

Available denoising architectures

  • Co-Teaching, called coteaching
  • Co-Teaching+, called coteaching_plus
  • JoCoR, called jocor
  • Noise Adaptation Layer, called nad
  • DivideMix, called dividemix

Achieved performances

Results of our new architectures benchmarked against themselves, including denoising methods

Performance Comparison of all models and denoising architectures for C- and N-terminal Performance Comparison of all models and denoising architectures for C- and N-terminal

Ablation analysis results of our best method, the BiLSTM

Ablation study results

Comparison of our best method, the BiLSTM, to other published methods (in % AUC)

Method C-Terminal N-Terminal
PCPS 51.3 50.0
PCM 64.5 52.4
NetChop 3.1 (20S) 66.1 52.7
NetChop 3.1 (C-term) 81.5 51.0
SVM* 84.8 73.2
PCM* 85.3 75.5
Logistic Regression* 86.2 76.2
NetCleave* 86.9 76.4
PUUPL* 87.2 78.0
Pepsickle* 88.1 78.9
Our BiLSTM (6+4) 88.5 79.5
Our BiLSTM (28+28) 92.3 89.3

* Method has been re-trained from scratch on our dataset.

For other methods, we used published pre-trained models (NetChop, PCM), or web-server functionality (PCPS).

Sources for model architectures and denoising approaches

LSTM Architecture

LSTM Attention Architecture

CNN Architecture

Prot2Vec Embeddings

FwBw Architecture

MLP Architecture

T5 Architecture

ESM2 Architecture

Noise Adaptation Layer

Co-teaching

  • Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github

Co-teaching+

  • Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github

JoCoR

  • JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github

DivideMix

  • DivideMix structure is based on Li et al., 2020, Github
  • As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
    • This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section

Sources for other published methods included in the benchmark

PCPS

PCM

NetChop 3.1 (20S and C-term)

SVM

Logistic Regression

NetCleave

PUUPL

Pepsickle

About

[Outdated, see new and extended repo: https://github.com/ziegler-ingo/cleavage_benchmark] Code and dataset for "Proteasomal cleavage prediction: state-of-the-art and future directions"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •