The full training script for Enformer (Tensorflow Sonnet) on TPU clusters, in an effort to migrate the model to pytorch.
This was pieced together from the Deepmind Enformer repository, the colab training notebook, as well as Basenji sequence augmentation code
It accounts for:
- distributed TPU training
- distributed datasets
- distributed validation
- gradient clipping
- cross replica batchnorms
- dataset augmentation
Training takes about 3 days on v3-64
$ gsutil cp gs://basenji_barnyard/hg38.ml.fa.gz ./ && gunzip hg38.ml.fa.gz
$ gsutil cp gs://basenji_barnyard/mm10.ml.fa.gz ./ && gunzip mm10.ml.fa.gz
$ gsutil cp gs://basenji_barnyard/data/human/sequences.bed ./human-sequences.bed
$ gsutil cp gs://basenji_barnyard/data/mouse/sequences.bed ./mouse-sequences.bed
- fix script for differences in sequence length in basenji training data, which is ~130k vs ~190k bp as in paper - Training in progress
@article {Avsec2021.04.07.438649,
author = {Avsec, {\v Z}iga and Agarwal, Vikram and Visentin, Daniel and Ledsam, Joseph R. and Grabska-Barwinska, Agnieszka and Taylor, Kyle R. and Assael, Yannis and Jumper, John and Kohli, Pushmeet and Kelley, David R.},
title = {Effective gene expression prediction from sequence by integrating long-range interactions},
elocation-id = {2021.04.07.438649},
year = {2021},
doi = {10.1101/2021.04.07.438649},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649},
eprint = {https://www.biorxiv.org/content/early/2021/04/08/2021.04.07.438649.full.pdf},
journal = {bioRxiv}
}