-
Notifications
You must be signed in to change notification settings - Fork 0
Intro to ML for neurogenomics
We recommend that you get familiar with all the below skill sets:
Coding:
- While the primary language used within the lab is R, for ML work, you will need to use Python
- You will need to become familiar with deep Learning frameworks (Pytorch and possibly Tensorflow/Keras)
- Some important models in the field (e.g. Enformer) were developed using TensorFlow
- However, it is best to focus on learning PyTorch. Most new ML methods now appear in pytorch first as this is where most of the field are working. PyTorch gives more control and is just a bit nicer to use (although there is a slightly steeper learning curve with it).
- Know how to create and efficiently use data loaders and model architecture.
How to efficiently use GPUs for model training:
- Linux programming skills
- Distributed training with multiple GPUs & physical data location in relation to the GPUs
- Learn to use the GPUs on the Imperial HPC
- Learn to use the GPUs on our private cloud
Model training monitoring:
- weights and biases is an important tool for monitoring how your training runs are progressing. It can work for both PyTorch and TensorFlow and you should get familiar with it's usage.
Once you have progressed with this:
- Get used to using Enformer Celltyping
- clone the repo on your personal folder on the HPC and then have a go working through the install instructions including creating a conda environment and running the using enformer celltyping tutorial first in an interactive jupyter notebook on the HPC then take part of the code and submit it as a job to the cluster so you get used to it.
- Try running the BXD code
How to propose the problem and correctly create a training, validation and test set to show appropriate performance and avoid data leakage. Jacob Schreiber has a good paper on this: Navigating the pitfalls of applying machine learning in genomics
The Enformer model is one of the key papers in the field, important for having shown the attention can be used to extend the range of input data.
The Avocado model can be used to impute missing genomic tracks from the ENCODE project: this paper first introduced the model, while this paper showed how it can be extended.
The Kipoi repository collects trained models in genomics
It’s worth reading this recent review paper but note Anshul Kundaje’s comments on it here. An older but well written review is here: https://www.nature.com/articles/s41576-019-0122-6
This is a brief set of notes thrown together but some of those who work in the area include:
We have access to three main sets of GPUs:
- Our private cloud with 1x A100 GPU and an NVideo RTX 2080 Ti
- The Imperial HPC
- Payam's cluster of 20 A100 GPUs and over 3000 CPU cores
Our recommended approach is:
- Test changes and small runs on the private cluster GPU so that you get instant results
- If multiple GPUs are needed swap to submitting the 'full' job on the HPC.
- Caveat to this is that the HPC has a pretty short time limit (72 hours) so if you need to train for longer, use the GPU on the private cluster
While the 20 GPU cluster is powerful, it is run on kubernetes system and currently requires that you use jupyter notebooks with it which has a quite large overhead over running a python file. Importanly, the system also resets every night so the max you could run for is 24 hrs. It may be worth returning to check this out though and we encourage you to check to see if this is still the case.
- Home
- Useful Info
- To do list for new starters
- Recommended Reading
-
Computing
- Our Private Cloud System
- Cloud Computing
- Docker
- Creating a Bioconductor package
- PBS example scripts for the Imperial HPC
- HPC Issues list
- Nextflow
- Analysing TIP-seq data with the nf-core/cutandrun pipeline
- Shared tools on Imperial HPC
- VSCode
- Working with Google Cloud Platform
- Retrieving raw sequence data from the SRA
- Submitting read data to the European Nucleotide Archive
- R markdown
- Lab software
- Genetics
- Reproducibility
- The Lab Website
- Experimental
- Lab resources
- Administrative stuff