esm_embed

Generating ESM2 protein embeddings. The ESM team has provided a utility script to do this (esm-extract) that was not originally available when we made this repository.

Note: We have integrated this into the overall Protein Set Transformer workflow for an end-to-end pipeline to convert protein FASTA files to protein embeddings to genome embeddings. This repository is archived for reproducibility into the history of the PST paper.

Installation

Without GPUs

# in my experience, conda always handles pytorch installation better than pip
mamba create -n esm -c pytorch -c conda-forge 'pytorch>=2.0' 'python<3.12' cpuonly

mamba activate esm

pip install git+https://github.com/cody-mar10/esm_embed.git

With GPUs

# in my experience, conda always handles pytorch installation better than pip
mamba create -n esm -c pytorch -c nvidia -c conda-forge 'pytorch>=2.0' 'python<3.12' pytorch-cuda=11.8

mamba activate esm

pip install git+https://github.com/cody-mar10/esm_embed.git

Installing this repository will create an executable called esm-embed to use for embedding protein FASTA files with ESM2 models.

Usage

The bare minimum arguments to the esm-embed executable are:

esm-embed \
    --input FASTAFILE \
    --outdir OUTDIR \
    --esm ESM MODEL

To specify the locations of the input FASTA file, the output directory, and which ESM2 model you want. See the -h help page for allowed arguments. The corresponding information about each model can be found in the ESM repository.

Other arguments are for controlling the computational resources used:

--devices number of GPUs or CPU threads
--accelerator GPU or CPU, defaults to autodetecting
--precision floating point precision for output embeddings. It is not recommended to use 64-bit due to the storage and memory required.

You can also specify where the ESM model will be downloaded (or where it was downloaded) using the --torch-hub argument.

Output format

The output of esm-embed is a .h5 with the field data that stores the embedding for each protein in the input FASTA file IN THE SAME ORDER as the FASTA file.

If you would like to install additional libraries to work with .h5 files, you can also install this repository using:

pip install "esm_embed[h5] @ git+https://github.com/cody-mar10/esm_embed.git"

which will install the pytables package.

Test run

We have provided a 10-sequence protein FASTA file for a test run:

esm-embed --input test/test.faa --outdir test/test_output --esm esm2_t6_8M --torch-hub test

The output embeddings have been provided for you to compare with.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
chtc		chtc
src/esm_embed		src/esm_embed
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

esm_embed

Installation

Without GPUs

With GPUs

Usage

Output format

Test run

About

Releases

Packages

Languages

License

cody-mar10/esm_embed

Folders and files

Latest commit

History

Repository files navigation

esm_embed

Installation

Without GPUs

With GPUs

Usage

Output format

Test run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages