Pytorch implementation of DietNetwork (https://arxiv.org/abs/1611.09340)
- create_dataset.py : Create dataset and partition data into folds. The script takes snps.txt and labels.txt files as input to create dataset.npz and folds_indexes.npz
- generate_embedding.py : Takes dataset.npz and folds_indexes.npz files created in the previous step and computes the embedding (genotypic frequency) of every fold. Embedding of each fold is saved in embedding.npz
- Missing values are -1 and are not included in the computation of genotypic frequencies embedding
- Embedding values are computed on train and valid sets
- train.py : Whole training process. The data is divided in train/valid and test sets. Performance is reported on the test set.
- Data preprocessing of auxiliary net : Square Euclidean distance normalization
- Data preprocessing of discrim net: Missing values are replaced by the mean of the feature computed on training set. Data normalization (standardization) using mean and sd computed on training set.
- test_external_dataset.py : Test model on an external set, ie on individuals that are not part of dataset.npz
- evaluate.py : Utilities to visualize the model performance such as confusion matrix
- dataset_utils.py : Data related functions (shuffle, partition, split, get_fold_data, replace_missing_values, normalize, ...)
- model.py : Model definition of feature embedding (auxiliary) and discriminative (main) networks.
- mainloop_utils.py : Function used in the training loop (get_predictions, compute_accuracy, eval_step, ...)
- log_utils.py : Utilities to save data (model summary and parameters, experiment parameters, predictions, etc.)
- test_utils.py : Utilities related to testing a trained model on an external set
- snps.txt : File of genotypes in additive encoding format and tab-separated.
- labels.txt : File of samples and their label.
- dataset.npz : Dataset created from the parsed snps.txt and labels.txt files.
- folds_indexes.npz : Array index (arrays are in dataset.npz) for each fold. The indexes are those of the data points to use as test.
- embedding.npz : Computed embeddings of each fold.
- exp_params.log : Experiment parameters (fixed seed, learning rate, number of epochs, etc.)
- model_summary.log : Model information (number of hidden layers, number of neurons in each layers, activation functions, etc.)
- model_params.pt : Model parameters of final trained model
- model_predictions.npz: Scores and predictions returned by the trained model for test samples
- additional_data.npz : Some more information used at training time (mus and sigmas values used for normalization, feature names, label names, training samples ids, validation samples ids, etc.)
- Embedding
- Data preprocessing : Missing values
- Data preprocessing : Data normalization
- Dataset class (for dataloader)
- Auxiliary and Main networks models
- Training loop
- Loss/Accuracy monitoring of train and valid
- Early stopping
- Test for in-sample data
- Test in-sample with missing values rates
- Test for out-of-sample data
- Save model params, results
- Python 3.6
- torch 1.5.0+cu101
- numpy 1.19.1
- pandas 1.1.0
- matplotlib 3.2.1
- captum 0.2.0 (https://captum.ai/)
- h5py 2.10.0