Skip to content
This repository has been archived by the owner on Sep 25, 2024. It is now read-only.

cody-mar10/esm_embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

esm_embed

Generating ESM2 protein embeddings. The ESM team has provided a utility script to do this (esm-extract) that was not originally available when we made this repository.

Note: We have integrated this into the overall Protein Set Transformer workflow for an end-to-end pipeline to convert protein FASTA files to protein embeddings to genome embeddings. This repository is archived for reproducibility into the history of the PST paper.

Installation

Without GPUs

# in my experience, conda always handles pytorch installation better than pip
mamba create -n esm -c pytorch -c conda-forge 'pytorch>=2.0' 'python<3.12' cpuonly

mamba activate esm

pip install git+https://github.com/cody-mar10/esm_embed.git

With GPUs

# in my experience, conda always handles pytorch installation better than pip
mamba create -n esm -c pytorch -c nvidia -c conda-forge 'pytorch>=2.0' 'python<3.12' pytorch-cuda=11.8

mamba activate esm

pip install git+https://github.com/cody-mar10/esm_embed.git

Installing this repository will create an executable called esm-embed to use for embedding protein FASTA files with ESM2 models.

Usage

The bare minimum arguments to the esm-embed executable are:

esm-embed \
    --input FASTAFILE \
    --outdir OUTDIR \
    --esm ESM MODEL

To specify the locations of the input FASTA file, the output directory, and which ESM2 model you want. See the -h help page for allowed arguments. The corresponding information about each model can be found in the ESM repository.

Other arguments are for controlling the computational resources used:

  • --devices number of GPUs or CPU threads
  • --accelerator GPU or CPU, defaults to autodetecting
  • --precision floating point precision for output embeddings. It is not recommended to use 64-bit due to the storage and memory required.

You can also specify where the ESM model will be downloaded (or where it was downloaded) using the --torch-hub argument.

Output format

The output of esm-embed is a .h5 with the field data that stores the embedding for each protein in the input FASTA file IN THE SAME ORDER as the FASTA file.

If you would like to install additional libraries to work with .h5 files, you can also install this repository using:

pip install "esm_embed[h5] @ git+https://github.com/cody-mar10/esm_embed.git"

which will install the pytables package.

Test run

We have provided a 10-sequence protein FASTA file for a test run:

esm-embed --input test/test.faa --outdir test/test_output --esm esm2_t6_8M --torch-hub test

The output embeddings have been provided for you to compare with.

About

Generate ESM-2 embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published