Abstract
Cancer is a heterogeneous disease requiring costly genetic profiling for better understanding and management. Recent advances in deep learning have enabled cost-effective predictions of genetic alterations from whole slide images (WSIs). While transformers have driven significant progress in non-medical domains, their application to WSIs lags behind due to high model complexity and limited dataset sizes. Here, we introduce SEQUOIA, a linearized transformer model that predicts cancer transcriptomic profiles from WSIs. SEQUOIA is developed using 7,584 tumor samples across 16 cancer types, with its generalization capacity validated on two independent cohorts comprising 1,368 tumors. Accurately predicted genes are associated with key cancer processes, including inflammatory response, cell cycles and metabolism. Further, we demonstrate the value of SEQUOIA in stratifying the risk of breast cancer recurrence and in resolving spatial gene expression at loco-regional levels. SEQUOIA hence deciphers clinically relevant information from WSIs, opening avenues for personalized cancer management.
Overview
scripts
: example bash (driver) scripts to run the pre-processing, training and evaluation.examples
: example input files.pre-processing
: pre-processing scripts.evaluation
: evaluation scripts and output gene list ordered by index.spatial_vis
: scripts for generating spatial predictions of gene expression values.src
: main files for models and training.
Software dependencies and versions are listed in requirements.txt
First, clone this git repository: git clone https://github.com/gevaertlab/sequoia-pub.git
Then, create a conda environment: conda create -n sequoia python=3.9
and activate: conda activate sequoia
Install the openslide library: conda install -c conda-forge openslide==4.0.0
Install the required package dependencies: pip install -r requirements.txt
Finally, install Openslide (>v3.4.0)
Expected installation time in normal Linux environment: 15 mins
Scripts for pre-processing are located in the pre-processing
folder. All computational processes requires a reference.csv file, which has one row per WSI and their corresponding gene expression values. The RNA columns are named with the following format 'rna_{GENENAME}'. An optional 'tcga_project' column indicates the TCGA project that data belongs to. See examples/ref_file.csv
for an example.
To extract patches from whole-slide images (WSIs), please use the script patch_gen_hdf5.py
.
An example script to run the patch extraction: scripts/extract_patch.sh
Note, the --start
and --end
parameters indicate the rows (WSIs) in the reference.csv file that need to be extracted. This is useful to execute the script in parallel.
To obtain resnet/uni features from patches, please use the script compute_features_hdf5.py
. The script converts each patch into a linear feature vector.
Note: if you use the UNI model, you need to follow the installation procedure in the original github and install the necessary required packages.
An example script to run the patch extraction: scripts/extract_resnet_features.sh
The next step once the resnet/uni features have been obtained is to compute the 100 clusters used as input for the model. They are computed per slide, so it is pretty straightforward, and it is pretty fast.
An example script to run the patch extraction: scripts/extract_kmean_features.sh
- Outputs from Step 2 and Step 3: features folder, this contains for each WSI a HDF5 file that stores both the features obtained using the resnet/uni (inside the resnet_features or uni_features dataset) as well as the output from the K-means algorithm (inside cluster_features dataset).
Expected run time: depend on the hardware (CPU/GPU) and the number of slides
To pretrain the weights of the model on normal tissues, please use the script pretrain_gtex.py
. The process requires an input reference.csv file, indicating the gene expression values for each WSI. See examples/ref_file.csv
for an example.
Our pre-trained checkpoint weights for SEQUOIA are available on HuggingFace. We release the weights for each cancer type, from each of the five folds. Patients that were present in the test set in each fold can be found in src/folds
. To load the SEQUOIA model weights (UNI features and linearized transformer aggregation), use:
Make sure to login to HuggingFace (See HuggingFace docs for more information):
from huggingface_hub import login
login()
Then:
from src.tformer_lin import ViS
cancer = 'brca'
i = 0 ## fold number
model = ViS.from_pretrained(f"gevaertlab/sequoia-{cancer}-{i}")
The gene names corresponding to the output can be found in the evaluation/gene_list.csv
file.
Now we can train the model from scratch or fine-tune it on the TCGA data. Here is an example bash script to run the process: scripts/run_train.sh
The parameters are explained within the main.py
file.
Some points that we want to emphasize:
- If you pre-trained on a dataset that contains a different number of genes than the finetuning dataset, you need to set the
--change_num_genes
parameter to 1 and specify in the--num_genes
parameter how many genes were used for pretraining. To indicate the path to the pretrained weights, use the--checkpoint
parameters. --model_type
is used to define the aggregation type. For the SEQUOIA model (linearized transformer aggregation) use 'vis'.
For running the benchmarked variations of the architecture:
- MLP aggregation: for this part we made use of the implementation from HE2RNA, which can be found in
he2rna.py
. An example run script is provided inscripts/run_he2rna.sh
- transformer aggregation: this model type is implemented in the
main.py
. use--model_type
'vit'.
Pearson correlation and RMSE values are calculated to compare the predicted gene expression values to the ground truth. The significantly well predicted genes are selected using correlation coefficient, p value, rmse, and by statistical comparisons to an untrained model with the same architecture.
Evaluation script: evaluation/evaluate_model.py
. Output: three dataframes all_genes.csv
: contains evaluation metrics for all genes, sig_genes.csv
: metrics for only the significant genes and num_sig_genes.csv
contains the number of significant genes per cancer type with this model.
Scripts for predicting spatial gene expression levels within the same tissue slide are wrapped in: spatial_vis
visualize.py
is the file to generate spatial predictions made with a saved SEQUOIA model.- the arguments are explained in the file. an example run file is provided in
scripts/run_visualize.sh
- output: the output is a dataframe that contains the following columns:
- xcoord: the x coordinate of a tile (absolute position of tile in the WSI -- note that adjacent tiles will have coordinates that are tile_width apart!) - ycoord: same as xcoord for the y - xcoord_tf: the x coordinate of a tile when transforming the original coordinates to start in the left upper corner at position x=0,y=0 and with distance 1 between tiles (i.e. next tile has coordinate x=1,y=0) - ycoord_tf: same as xcoord_tf for the y - gene_{x}: for each gene, there will be a column 'gene_{x}' that contains the spatial prediction for that gene of the model from fold {x}, with x = 1..num_folds - gene: for each gene there will also be a column without the _{x} part, which represents the average across the used folds
- the arguments are explained in the file. an example run file is provided in
get_emd.py
contains code to calculate the two dimensional Earth Mover's Distance between a prediction map (generated withvisualize.py
script) and ground truth spatial transcriptomics.gbm_celltype_analysis.py
contains (1) code to examine spatial co-expression of genes for the four meta-modules described in the paper; (2) code to visualize spatial organization of meta-modules on the considered slides.
© Gevaert's Lab MIT License