This repository contains the codebase and results for GOLF:
- Multiple Sequence Alignment (MSA) generation from protein sequence data.
- Phylogenetic analysis to understand evolutionary relationships.
- Training and fine-tuning protein language models:
- Evolutionary model of Variant Effect (EVE) ensemble.
- ESM-1b model.
- Interpreting variant effects using Sparse Autoencoders (SAEs) on ESM2 embeddings.
The overall process is detailed below:
- Sequence Collection: Homologous sequences are collected using
jackhmmer
against the UniRef50 database. - MSA Creation: After removing duplicate sequences,
MMSeqs2
(specifically theeasy-cluster
command) is employed to generate a Multiple Sequence Alignment (MSA). This MSA is constructed with a sequence identity threshold of 95% and coverage of 80%. - This MSA serves as the primary input for training the EVE models and fine-tuning the ESM-1b model.
- A phylogenetic tree is constructed to analyze the evolutionary context of the sequences, using an MSA clustered at 80% sequence identity.
- The
phylo/
directory houses all relevant scripts and data for this stage. - Tree Construction: IQ-TREE is used for building the phylogenetic tree. The specific command can be found in
phylo/README.md
.- Input:
phylo/MSA_80cluster.fas
- Output:
phylo/MSA_80cluster.treefile
(raw tree),phylo/MSA_80cluster_cleaned.treefile
(processed tree).
- Input:
- Processing and Visualization: The script
phylo/clean_tree.py
standardizes leaf names in the tree and generates annotation files (e.g.,phylo/dataset_colorstrip.txt
) for enhanced visualization with iTOL (Interactive Tree Of Life).
- The Evolutionary model of Variant Effect (EVE) is utilized to predict variant effects.
- Model Training:
- The core Variational Autoencoder (VAE) for EVE models is trained using a script analogous to
train_VAE.py
, taking the MSA as input. - Evolutionary indices, a key output from EVE models indicating variant impact, are computed from the VAE's latent space using a script like
compute_evol_indices.py
.
- The core Variational Autoencoder (VAE) for EVE models is trained using a script analogous to
- Ensemble Creation: An ensemble of EVE models is created by training multiple models, typically with different random seeds. These models, along with their configurations (
model_params.json
), checkpoints, logs, and computedevol_indices
, are stored in subdirectories withinEVE Ensemble/
(e.g.,OLF-40_seed100_seed100_theta0.25_ld40_lr0.0001/
). - Ensemble Performance Assessment: The
Ensemble Analysis/
directory contains scripts for evaluating the EVE ensemble:ensemble_evol_indices_analysis.py
: This script aggregatesevol_indices
from individual models in the ensemble. It then employs a Gaussian Mixture Model (GMM) analysis usingtrain_GMM_and_compute_EVE_scores.py
to convert evolutionary indices into pathogenicity scores and calculates prediction accuracy against a ground truth set of mutations.visualize_ensemble_results.py
: Generates various plots to illustrate the ensemble's accuracy, its improvement over individual models, and other performance metrics.- Analysis results, including plots and summary CSV files, are saved in this directory.
- Relevant data sources are found in
/EVE Data
to run the aformentioned scripts.
- The ESM-1b protein language model is fine-tuned on the OLF MSA to adapt it for variant effect prediction.
- Fine-tuning Process: The script
fine_tune_esm1b.py
manages this process. It loads the MSA, freezes the initial layers of the pre-trained ESM-1b model, and fine-tunes the subsequent layers. - Outputs: The fine-tuned model checkpoints (e.g.,
esm1b_finetuned.pt
,best_model.pt
), training logs, and related plots are stored in theESM/ESM1b/
directory.
- To understand the features learned by large protein models and how they relate to variant effects, a Sparse Autoencoder (SAE) is applied to ESM2 embeddings.
- The
SAE/
directory contains all scripts, configuration files, and detailed instructions for this analysis (seeSAE/README.md
). - Probing SAE Latents:
probe_sae.py
: This script computes SAE activations from mean-pooled ESM2 embeddings (specifically from layer 24) for a set of variants. It then trains a linear regression model (Ridge regression) to predict EVE scores based on these SAE activations. The weights of this linear model indicate which SAE latent dimensions are most predictive of pathogenicity.
- Visualization:
visualize_sae.py
: Identifies the top SAE latent dimensions (both pathogenic and benign-associated) based on the probe weights. It generates:- A PyMOL script (
highlight_units_layer24.pml
) to visualize these latents and their associated residues on the protein structure. - A text file (
highlighted_latents_layer24.txt
) summarizing residue-level associations for the top latents.
- A PyMOL script (
- Input data for this step typically includes a list of mutated sequences and their corresponding EVE scores (e.g.,
SAE/mutated_sequences_with_scores.csv
).
Key directories and files in this repository:
README.md
: This file.fine_tune_esm1b.py
: Script for fine-tuning the ESM-1b model.train_VAE.py
: Script for training the VAE component of EVE models.compute_evol_indices.py
: Script to compute evolutionary indices from trained EVE models.phylo/
: Contains scripts, data, and README for phylogenetic analysis.clean_tree.py
: Processes phylogenetic trees and generates iTOL annotations.
EVE Ensemble/
: Stores trained EVE models from multiple runs/seeds.- Each subdirectory contains model parameters, checkpoints, logs, and evolutionary indices.
Ensemble Analysis/
: Scripts and results for EVE ensemble performance analysis.ensemble_evol_indices_analysis.py
: Core script for ensemble evaluation.visualize_ensemble_results.py
: Generates plots for ensemble performance.
ESM/
: Contains fine-tuned ESM model artifacts.ESM1b/
: Fine-tuned ESM-1b model checkpoints, logs, and plots.
SAE/
: Scripts, data, and detailed README for Sparse Autoencoder analysis.sae.yml
: Conda environment definition for SAE tasks.probe_sae.py
: Script for training a linear probe on SAE embeddings.visualize_sae.py
: Script for visualizing predictive SAE latents.
utils/
: Helper functions for EVE.data/
: Data used to construct MSA.examples/
: Example script runs for EVE.
General setup guidelines. For specific modules like SAE, refer to their dedicated README files (e.g., SAE/README.md
).
- Core Tools:
- Ensure
jackhmmr
(from the HMMER suite) andMMSeqs2
are installed and available in your system's PATH.
- Ensure
- Phylogenetic Analysis:
- IQ-TREE: Required for phylogenetic tree construction. Download from the IQ-TREE website.
- Python: A Python environment with
pandas
is needed forphylo/clean_tree.py
.
- EVE Model Training & Analysis:
- The EVE framework relies on Python with standard scientific libraries (NumPy, Pandas) and PyTorch. Ensure these are installed.
- Refer to EVE documentation for specific version requirements if available.
- ESM-1b Fine-tuning:
- Python Environment: Requires PyTorch and the
esm
library by Facebook Research.pip install torch fair-esm matplotlib pandas tqdm
- Hardware: A GPU is highly recommended for efficient fine-tuning.
- Python Environment: Requires PyTorch and the
- SAE Analysis:
- A dedicated Conda environment is specified in
SAE/sae.yml
. Create and activate it:conda env create -f SAE/sae.yml conda activate sae
- Follow instructions in
SAE/README.md
to clone theInterProt
repository and download necessary ESM2 and SAE models, placing them inSAE/models/
.
- A dedicated Conda environment is specified in
This section provides guidance on executing the different stages of the analysis pipeline.
- Use
jackhmmr
to search against UniRef50 and gather sequences. - Process the output to remove duplicates.
- Use
MMSeqs2 easy-cluster
(with 95% identity, 80% coverage) to generate the MSA. This MSA (e.g.,your_msa.a3m
oryour_msa.fasta
) will be used in subsequent steps. - Prepare a version of the MSA for phylogenetic analysis (e.g., clustered at 80% identity,
phylo/MSA_80cluster.fas
).
- Navigate to the
phylo/
directory. - Ensure your MSA for phylogeny (e.g.,
MSA_80cluster.fas
) is present. - Run IQ-TREE using the command specified in
phylo/README.md
. - Execute
python clean_tree.py
to process the output tree and generate annotation files. - Upload the cleaned treefile (e.g.,
MSA_80cluster_cleaned.treefile
) and annotation files (e.g.,dataset_colorstrip.txt
) to iTOL for visualization.
-
Training an EVE Model:
# Train the VAE component python train_VAE.py --msa_file path/to/your_msa.a3m \\ --output_dir path/to/your_eve_model_dir \\ ... # Compute evolutionary indices python compute_evol_indices.py --model_checkpoint path/to/your_eve_model_dir/checkpoints/best_model.pt \\ --msa_file path/to/your_msa.a3m \\ --output_file path/to/your_eve_model_dir/evol_indices/evol_indices.csv \\ ...
Repeat with different seeds/configurations for ensemble members, storing outputs in
EVE Ensemble/
. -
Ensemble Analysis:
- Navigate to
Ensemble Analysis/
. - Ensure paths to individual model
evol_indices
files (withinEVE Ensemble/
) are correctly referenced or configured withinensemble_evol_indices_analysis.py
. - Provide ground truth mutation data if required by the script.
- Run the analysis:
python ensemble_evol_indices_analysis.py
- Generate plots using the output from the analysis:
python visualize_ensemble_results.py --results_file path/to/ensemble_results.csv
- Navigate to
- Prepare your MSA file (e.g.,
your_msa.a3m
oryour_msa.fasta
). - Run the fine-tuning script:
The fine-tuned model and logs will be saved in the specified output directory.
python fine_tune_esm1b.py --msa_file path/to/your_msa.a3m_or_fasta \\ --output_dir ESM/ESM1b/my_finetuned_model \\ --model_name facebook/esm1b_t33_650M_UR50S \\ --epochs 5 \\ --batch_size 1 \\ --learning_rate 1e-5 \\ --num_frozen_layers 30 ...
- Activate the
sae
conda environment:conda activate sae
. - Ensure ESM2 and SAE models are downloaded and correctly placed in
SAE/models/
as perSAE/README.md
. - Navigate to the
SAE/
directory. - Prepare your input data:
mutated_sequences_with_scores.csv
(containing sequence variants and their EVE scores or other pathogenicity labels). - Run the SAE probing script:
This will generate files like
python probe_sae.py
results/weights/sae_raw_layer24.csv
. - Run the visualization script:
This generates
python visualize_sae.py
highlight_units_layer24.pml
for PyMOL andhighlighted_latents_layer24.txt
.