Implementation of our paper titled "Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data" submitted to IEEE Access journal, August 2019. In this implementation, a multimodal autoencoders(MAE) is used to predict different clinical status of breast cancer patients based on multiplatformic genomics data. The MAE is trained with genomics data such as DNA methylation, gene expression, miRNA expressionfrom, and clinical outcomes from The Cancer Genome Atlas(TCGA).
- Breast cancer subtypes which is determined by the estrogen receptor (ER), progesterone receptor (PGR), and HER2/neu status
- Survival rate (0-1, with 1 being the best chance of survival).
- Python 3
- TensorFlow
- Keras.
DATASET_IDX | Data Types | Data size(GB) |
---|---|---|
1 | DNA Methylation | 148 |
2 | Gene Expression | 9 |
3 | miRNA Expression | 0.24 |
4 | Gene Expression + miRNA Expression | 10 |
5 | DNA Methylation + Gene Expression + miRNA Expression | 162 |
python3 main_run.py <options>
, with the below supported options:
Option | Values | Details | Required |
---|---|---|---|
-p PLATFORM --platform PLATFORM |
int [1-2] | [1] Tensorflow, [2] Theano | yes |
-t TYPE --type TYPE |
int [1-2] | [1] Breast cancer type classification [2] Survival rate regression |
yes |
-d DATASET --dataset DATASET |
int [1-15] | [1] DNA Methylation GPL8490 [2] DNA Methylation GPL16304 [3] Gene Expression Count [4] Gene Expression FPKM [5] Gene Expression FPKM-UQ [6] miRNA Expression [7] Gene Expression Count + miRNA Expression [8] Gene Expression FPKM + miRNA Expression [9] Gene Expression FPKM-UQ + miRNA Expression [10] DNA Met GPL8490 + Gene Count + miRNA [11] DNA Met GPL16304 + Gene Count + miRNA [12] DNA Met GPL8490 + Gene FPKM + miRNA [13] DNA Met GPL16304 + Gene FPKM + miRNA [14] DNA Met GPL8490 + Gene FPKM-UQ + miRNA [15] DNA Met GPL16304 + Gene FPKM-UQ + miRNA |
yes |
--pretrain_epoch PRE_EPOCH | int | Pre-training epoch. Default = 100 | no |
--train_epoch TRAIN_EPOCH | int | Training epoch. Default = 100 | no |
--batch BATCH | int | Batch size for pre-training and training. Default = 10 | no |
--pre_lr PRE_LR | int | Pre-training learning rate. Default = 0.01 | no |
--train_lr TRAIN_LR | int | Training learning rate. Default = 0.1 | no |
--dropout DROPOUT | int | Dropout rate. Default = 0.2 | no |
--pca PCA | int [1-2] | [1] Use PCA [2] Don't use PCA Default = [2] Don't use |
no |
--optimizer OPTIMIZER | int [1-3] | [1] Stochastic gradient descent [2] RMSProp [3] Adam Default = [1] Stochastic gradient descent |
no |
If we want to perform breast cancer subtype classification based on the dime sion reduced DNA methylation dataset using PCA on TensorFlow platform, one can issue the following command from the terminal:
python3 main_run.py --platform 1 --type 1 --dataset 1 --batch 10 --pretrain_epoch 5 --train_epoch 5 --pca 1 --optimizer 3
In the preceding command, we define: -- 10 as the batch size -- 5 as the number of pretraining epoch -- 5 is the fine tuning epoch -- 3 is the idx for the Adam optimizer.