Skip to content

Intro to ML for neurogenomics

Nathan Skene edited this page Nov 28, 2023 · 14 revisions

General machine learning skills

We recommend that you get familiar with all the below skill sets:

Coding:

  • While the primary language used within the lab is R, for ML work, you will need to use Python
  • You will need to become familiar with deep Learning frameworks (Pytorch and possibly Tensorflow/Keras)
    • Some important models in the field (e.g. Enformer) were developed using TensorFlow
    • However, it is best to focus on learning PyTorch. Most new ML methods now appear in pytorch first as this is where most of the field are working. PyTorch gives more control and is just a bit nicer to use (although there is a slightly steeper learning curve with it).
  • Know how to create and efficiently use data loaders and model architecture.

How to efficiently use GPUs for model training:

  • Linux programming skills
  • Distributed training with multiple GPUs & physical data location in relation to the GPUs
  • Learn to use the GPUs on the Imperial HPC
  • Learn to use the GPUs on our private cloud

Model training monitoring:

  • weights and biases is an important tool for monitoring how your training runs are progressing. It can work for both PyTorch and TensorFlow and you should get familiar with it's usage.

Once you have progressed with this,

Machine learning for genomics

How to propose the problem and correctly create a training, validation and test set to show appropriate performance and avoid data leakage. Jacob Schreiber has a good paper on this: Navigating the pitfalls of applying machine learning in genomics

The Enformer model is one of the key papers in the field, important for having shown the attention can be used to extend the range of input data.

The Avocado model can be used to impute missing genomic tracks from the ENCODE project: this paper first introduced the model, while this paper showed how it can be extended.

The Kipoi repository collects trained models in genomics

It’s worth reading this recent review paper but note Anshul Kundaje’s comments on it here. An older but well written review is here: https://www.nature.com/articles/s41576-019-0122-6

Labs in the field

This is a brief set of notes thrown together but some of those who work in the area include:

Access to GPUs

We have access to three main sets of GPUs:

  • Our private cloud with 1x A100 GPU and an NVideo RTX 2080 Ti
  • The Imperial HPC
  • Payam's cluster of 20 A100 GPUs and over 3000 CPU cores

Our recommended approach is:

  • Test changes and small runs on the private cluster GPU so that you get instant results
  • If multiple GPUs are needed swap to submitting the 'full' job on the HPC.
  • Caveat to this is that the HPC has a pretty short time limit (72 hours) so if you need to train for longer, use the GPU on the private cluster

While the 20 GPU cluster is powerful, it is run on kubernetes system and currently requires that you use jupyter notebooks with it which has a quite large overhead over running a python file. Importanly, the system also resets every night so the max you could run for is 24 hrs. It may be worth returning to check this out though and we encourage you to check to see if this is still the case.

Clone this wiki locally