Predicting gene expression using millions of yeast promoters reveals cis-regulatory logic
Problem: Let
Data: We use data from DREAM Challenge consisting of 7 million random promoter sequences and the yellow fluorescent protein level. We then use the official test set from the challenge to evaluate our trained model(s).
Model: A residual convolutional neural network, strategically optimised using automated hyperparameter tuning.
The figure above shows the structure of the original (large variant) model (16M parameters). There is an almost equally good model that has 90% less parameters (1.4M). Please see the associated manuscript (preprint) for more details.
Assessment: Predictive, comparative
Assessment: Explanatory, Scientific discovery
Here are some details on what the purpose of each file is:
File | Purpose |
---|---|
gen_figs.ipynb |
A notebook to show (re-generate) some figures in the manuscript. |
train_rep.py |
Program to train several replicates of a Camformer model using training data. |
score_rep.py |
Program to test several replicates of a trained Camformer model on test data. |
Directory | Contents |
---|---|
base |
Contains core codebase, utility functions, auxiliary helper files etc. |
manuscript_figures |
Contains data, script and figures present in the manuscript. |
readme_figs |
Images used to prepare this nice-looking README file. |
analysis |
Contains some basic analysis of results. Contents may be updated. |
Relevant resources and previous Camformer repositories.
- Camformer repository (2022 version): DREAM2022 Submission
- DREAM 2022 Challenge Wiki Page
- Rafi et al., 2023: Paper
- Rafi et al., 2023: Official Evaluation