This repo contains all data used and generated during this work (Preprint). We also provide Notebooks to reproduce our work, inlcuding examples.
- Embedding contains notebooks to generate embeddings and train embeddings based predictors
- Finetuning contains notebooks to finetune all protein language models used in our work
- data contains all data for figures in the main manuscript
- SOM data contains all data for figures and tables in the Supplementary Online Material
- training_logs.zip contains the raw training history logging files our analysis is based on.
- training data.zip contains all training datasets used for this work. Each dataset consists of a training, validation, and test set. When using those data, please quote and consult the authors of the original data sets. But we recommend using the original data sources (linked below) as data available here will not be kept updated and mainly serves reproduction purposes.
- GFP and stability: https://github.com/songlab-cal/tape
- AAV, GB1, Meltome and secondary structure: https://github.com/J-SNACKKB/FLIP
- Subcellular Location: https://github.com/HannesStark/protein-localization
- Disorder: https://github.com/DagmarIlz/SETH
License
The data in this repository is released under terms of the CC-BY-4.0.
The source code in this repository is licensed under the MIT license, which you can find in the MIT-LICENSE.txt file.