The project jointly analyzes the blood RNA-seq gene expression and CT scan images from 1,223 subjects in the COPDGene study to identify the shared aspects of inflammation and lung structural changes that we refer to as Image-Expression Axes (IEAs). We extract the CT-images features using context-aware self-supervised representation learning (CSRL). These features were then tested for association with gene expression levels to select genes for future analysis. For the subset of selected genes, we trained a deep-learning model to identify IEAs that capture distinct patterns of association between CSRL features and blood gene expression. We then related these axes to cross-section COPD-related features and prospective health outcomes through regression and Cox proportional hazard models.
We identified two distinct IEAs that capture most of the relationship between CT images and blood gene expression: IEAemph captures an emphysema-predominant process with a strong positive correlation to CT emphysema and a negative correlation to FEV1 and Body Mass Index (BMI); IEAairway captures an airway-predominant process with a positive correlation to BMI and airway wall thickness and a negative correlation to emphysema. Pathway enrichment analysis identified 29 and 13 pathways significantly associated with IEAemph and IEAairway, respectively (adjusted p<0.001).
The data analysis can be separated into the following 4 parts: 1) Register and patchify the CT scans; 2) Extract the features from the processed CT scans; 3) Select genes based on the association between gene expression and extracted image features; 4) Train the deep learning model that identifies IEA using the image features and the expression levels of the selected genes. We provide the code for steps 3 and 4 in this repository. The code for steps 1 and 2 is given in https://github.com/batmanlab/Context_Aware_SSL.
The IEAs can be generated with the following steps:
git clone https://github.com/batmanlab/IEA.git
cd IEA
conda env create -f environment.yml -n IEA
conda activate IEA
python ./src/gene_selection.py
python ./src/train_IEA.py
The user can skip model training and download the pre-train models, by running the following code:
curl -L "https://docs.google.com/uc?export=download&id=1qeMC8y2jRU7iI0raWoT1YJZktNqT0S-y" --output primary_models.zip
unzip -o primary_models.zip
python ./src/summarize_cv.py
To generate the Supplemental Table E1, it is required to variate the thresholds of the adjusted p-values for gene selection and train the IEA models with different sets of selected genes. To train these models, run the following script:
chmod +x ./src/gene_thresholds.sh
./src/gene_thresholds.sh
To generate Figure 2, it is required to variate the number of IEAs when training the IEA models. To train these models, run the following script:
chmod +x ./src/num_IEAs.sh
./src/num_IEAs.sh
These two scripts might take long to run, and it is recommended to parallelize these scripts.
The user can skip model training and download the pre-trained models with the following code:
curl -L "https://docs.google.com/uc?export=download&id=10-JQ3R4hJmC1nXhzucedr2hMAFkOHoHn" --output models.zip
unzip -o models.zip
The primary results of the Tables and Figures can be regenerated with the folloing python notebooks:
Table 1 Subject characteristics in training and test data.
Table 2 Pearson correlation coefficients between image-expression axes (IEAs) and COPD-related characteristics and health outcomes.
Table 3 Multivariable associations of image-expression axes (IEAs) to continuous COPD-related characteristics and health outcomes.
Table 4 Multivariable associations of image-expression axes (IEAs) to Frequent Exacerbations and Mortality.
Figure 2 Variance of gene expression explained by IEAs as we variate the number of IEAs.
Figure 4 Distribution of IEAemph and IEAairway values grouped by previously published COPD K-means clustering subtypes.
Figure 5 Histograms for the variance of gene expression explained by the IEAs and PCA-Is.
Table E1 Cross-validation performance in IEA training.
Table E2 Linear Regression with image-expression axes (IEAs) and COPD measurements with COPDGene visit 2 data.
Table E3 Logistic regression and Cox proportional harzard models with image-expression axes (IEAs) and COPD measurements with COPDGene visit 2 data.
Table E4 Pearson’s correlation between image-expression axes (IEAs) and COPD-related characteristics and health outcomes, measured on 1,527 subjects from another subset of the COPDGene dataset that had not been used for model training.
Table E5 Linear Regression with image-expression axes (IEAs) and COPD measurements with 1,527 subjects from another subset of the COPDGene dataset that had not been used for model training.
Table E6 Logistic regression and Cox proportional harzard models with image-expression axes (IEAs) and COPD measurements with 1,527 subjects from another subset of the COPDGene dataset that had not been used for model training.
Table E7 Characteristics of subgroups defined by diving the Image-Expression Axes (IEAs) into quadrants.
Table E8 Covariances between Image-expression Axes (IEAs), factor analysis axes (FAs), and PCA image-only axes (PCA-Is) on COPDGene visit 1 data.
Table E9 Pearson correlation coefficients among image-expression axes (IEAs), factor analysis axes (FAs) and PCA image only axes (PCA-Is), COPD-related characteristics and health outcomes.
Table E10 Linear regression analysis with image-expression axes (IEAs) and factor analysis axes (FAs) on COPDGene visit 1 data.
Table E11 Logistic regression and Cox model with image-expression axes (IEAs) and factor analysis axes (FAs) on COPDGene visit 1 data.
Table E12 Linear regression analysis with image-expression axes (IEAs) and PCA Image Only Axes (PCA-Is) on COPDGene visit 1 data.
Table E13 Logistic regression and Cox model with image-expression axes (IEAs) and PCA Image Only Axes (PCA-Is) on COPDGene visit 1 data.
Table E18 The correlation between perc15 ratio and IEAs.