Repository for "Fine-tuning protein language models boosts predictions across diverse tasks"

This repo contains all data used and generated during this work (Preprint). We also provide Notebooks to reproduce our work, inlcuding examples.

Embedding contains notebooks to generate embeddings and train embeddings based predictors
Finetuning contains notebooks to finetune all protein language models used in our work
data contains all data for figures in the main manuscript
SOM data contains all data for figures and tables in the Supplementary Online Material
training_logs.zip contains the raw training history logging files our analysis is based on.
training data.zip contains all training datasets used for this work. Each dataset consists of a training, validation, and test set. When using those data, please quote and consult the authors of the original data sets. But we recommend using the original data sources (linked below) as data available here will not be kept updated and mainly serves reproduction purposes.
- GFP and stability: https://github.com/songlab-cal/tape
- AAV, GB1, Meltome and secondary structure: https://github.com/J-SNACKKB/FLIP
- Subcellular Location: https://github.com/HannesStark/protein-localization
- Disorder: https://github.com/DagmarIlz/SETH

License

The data in this repository is released under terms of the CC-BY-4.0.

The source code in this repository is licensed under the MIT license, which you can find in the MIT-LICENSE.txt file.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
notebooks		notebooks
source_data_SOM		source_data_SOM
source_data_main		source_data_main
LICENSE.txt		LICENSE.txt
MIT-LICENSE.txt		MIT-LICENSE.txt
README.md		README.md
training data.zip		training data.zip
training_logs.zip		training_logs.zip

Provide feedback