This project was done under the guidance of Dr. Sundaresan Raman, BITS Pilani, as part of the course CS F376 (Design Oriented Project) in the Second Semester of AY 21-22.
We explored three methods to identify certain Risk factors for Diabetic Retinopathy (DR) of Primary or Secondary Importance. Of these 3 approaches, we found most success with the Autoencoder using the SN-DREAMS dataset for DR.
The SN-DREAMS dataset (Dataset Link) contains 13 risk factors (columns) and the 14th column as an indicator of DR. Of these 13, 4 factors are categorical, while 9 are continuous. This data is available for 1555 patients (rows). However, the data is imablanced, since rows with DR = 1 are sparse. To combat this SMOTE-ENN and standardization are used.
Furthermore, we use Expert-Labeled Primary/Secondary clusters (File Link) as the 'Gold-Truth' to evaluate the results from each approach.
The first approach was simply to use K-Means Clustering (k=2, Initialization= k-means++) from the sklearn library. Further, T-SNE was also used for visualization (See below). However, nearly half the predictions were wrong and the classification didn't match the True labels.
K-Means Clustering and the resulting Confusion MatrixWe used a 70:30 Train-Test split and KNN classification (k=5, Minkowski Distance). We used the ROC-AUC Scores of each of the 13 features as a metric. If the score lied above a threshold, the cluster was classified as primary, else it was classified as secondary. However, 8 of the 13 predictions were incorrect.
KNN Classification and the resulting Confusion MatrixAgain, we used a 70:30 Train Test split and Standard Scaler. The Autoencoder had 2 Fully Connected (Dense) layers. These were the Code layer with 7 neurons and the Output layer with 14 neurons. Ultimately, the neural network had 217 parameters in total. The Autoencoder was trained using the Adam Optimizer for 15 epochs using Mmean Absolute Error as the metric.
Autoencoder Structure and ParametersThe weights learnt by the Hidden (Code) Layer were used to assign a score to each of the 13 risk factors. The median of these scores was used as a threshold and the factors with scores above it were assigned "Primary" and the others were "Secondary", as shown below. Since the training process is non-deterministic, there was variation in the results. On a good run, 9 or 10 of the risk factors are classified correctly. These weights could be saved and reloaded, rather than train the neural network each time.
- Khalid, S., Prieto-Alhambra, D. Machine Learning for Feature Selection and Cluster Analysis in Drug Utilisation Research. Curr Epidemiol Rep 6, 364–372 (2019). (Paper Link)
- Please refer to the PPTs for a more detailed analysis of the three Feature Selection methods and their results.