EMD601_ML_guest_lecture.qmd

---
title: "Supervised ML in clinical applications"
subtitle: "EXMD 601 McGill University"
author: "Alton Russell"
date: "27 March 2024"
format: revealjs
editor: visual
---

## R packages used

```{r}
#| echo: true
# install.packages("tidyverse","tidymodels", "ranger")
library(tidyverse)
library(tidymodels)
library(ranger)
library(caret) #for calibration plot
library(vip) #for variable importance plot
theme_set(theme_bw())
```

## Agenda

-   **Types of supervised learning**
-   Model development and selection
-   Clinically useful models

## Supervised learning in a nutshell

-   **Training:** Learn to predict on **labeled examples**

    -   Model maps features (covariates) to outcome (label)
    -   Can be complex, non-linear functions

-   **Deployment:** Generate prediction for new examples

![](stroke_neural_net_example.png)

## ML vs. statistical models

::: columns
::: {.column width="60%"}
-   Learn relationship between variables from data (not pre-specified)

-   Allow complex non-linear relationships

-   Less interpretable

-   Fewer theoretical guarantees

-   Generally, not suited for answering causal questions
:::

::: {.column width="40%"}
![](XKCD_ml.png)
:::
:::

## Regression

::: columns
::: {.column width="50%"}
-   Predict a **continuous** value (hemoglobin level, length of stay)

-   Statistics analog: linear regression
:::

::: {.column width="50%"}
![](regression_example.png)
:::
:::

## "Classification" or risk prediction

::: columns
::: {.column width="55%"}
-   Predict a **categorical** outcome/event (death, recurrence, rehospitalization)

-   Statistics analog: logistic regression

-   Can be binary (cancer/no cancer) or multiclass (which bacteria is causing the urinary tract infection)
:::

::: {.column width="45%"}
![](classification_plot.png)
:::
:::

## Classifying vs. predicting risk

-   Strict classification returns the most likely outcome

    -   "Patient will get pneumonia"
    -   Ignores uncertainty 😔

-   Estimated risk \>\> strict classification

    -   51% vs. 99% chance of pneumonia can have very different clinical implications

    -   [**Always**]{.underline} question classification models that don't provide ***probabilistic*** estimates

## Visualizing estimated risk

![](pdp-cervical-2d.jpeg)

[Interpretable Machine Learning by Molnar](https://christophm.github.io/interpretable-ml-book/) (partial dependency plot showing probability of cervical cancer given the interactino between age and number of pregnancies)

## Computer vision

::: columns
::: {.column width="50%"}
-   Analyze pixel data

-   Often, goal is to outline pbjects and assign label

    -   Person, car or tree

    -   Abnormality, tumor

-   Used for X-ray, ultrasound, microscopy images
:::

::: {.column width="50%"}
![](computer_vision_xray.PNG)

[Kundu et. al. 2021](https://doi.org/10.1371/journal.pone.0256630)
:::
:::

## Generative AI/Large language models

::: columns
::: {.column width="70%"}
![](LLM_model_tuning.png)
:::

::: {.column width="30%"}
-   Trained to mimic human text.

-   Prone to bias and can 'hallucinate'!

-   Can save time with vetting by expert
:::
:::

[Thirunavukarasu et. al. Nature Medicine 2023](https://doi.org/10.1038/s41591-023-02448-8)

## Agenda

-   Types of supervised learning
-   **Model development and selection**
-   Clinically useful models

## Our example data[^1]

[^1]: Dataset: [Fetal health classification on Kaggle](https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification). Images: [Thinkstock](https://www.babycenter.in/x1045384/what-is-cardiotocography-ctg-and-why-do-i-need-it); [geekymedics.com](https://geekymedics.com/how-to-read-a-ctg/)

Tabular data derived from **cardiotocography (CTGs)** from 2126 pregnant patients.

Outcome: fetal health is normal vs. suspect, or pathological.

::: columns
::: {.column width="50%"}
![](CTG_device.jpeg)
:::

::: {.column width="50%"}
![](CTG_output.jpeg)
:::
:::

## Our example data

```{r}
#| echo: true

dt <- read_csv("fetal_health.csv") |>
  mutate(fetal_health = as.factor(ifelse(fetal_health==1,"Normal", "Abnormal")))
str(dt)
```

## Model development must avoid over-/under-fitting

![](overfitting_classifier.jpeg)

[raaicode.com](https://www.raaicode.com/what-is-overfitting/)

## Holding out a test set

-   At start, set aside some data as test set

-   Use the remaining data to train model

-   At the [**very end**]{.underline}, run [**only one**]{.underline} [**final model**]{.underline} on the test set for an unbiased estimate of performance

![](train_test_split.png)

## Splitting with rsample

```{r}
#| echo: true
set.seed(125) #for reproducibility
ctg_split <- initial_split(dt, prop = 0.8)
ctg_split

#Distribution of outcome in training sets
table(training(ctg_split)$fetal_health)/nrow(training(ctg_split))
#Distribution of outcome in test sets
table(testing(ctg_split)$fetal_health)/nrow(testing(ctg_split))
```

## Stratifying data split

Stratifying is easy and ensures data in each split are representative

```{r}
#| echo: true

#Split while stratifying on outcome
set.seed(456)
ctg_split_strat <- initial_split(dt, prop = 0.8 ,strata = fetal_health)

#Distribution of outcome in training sets
table(training(ctg_split_strat)$fetal_health)/nrow(training(ctg_split_strat))
#Distribution of outcome in test sets
table(testing(ctg_split_strat)$fetal_health)/nrow(testing(ctg_split_strat))
```

## Model selection

-   If trying just one model configuration, use full training data

-   Usually, want to select best model from several **configurations** (algorithm + hyperparameters)

    -   Algorithm: type of model (e.g., random forest)

    -   Hyperparameter: 'setting' of that model (e.g., minimum node size)

-   Further steps needed to avoid biased model selection in training data

## Cross validation

![](cross_validation_overview.png)

[Statology](https://www.statology.org/validation-set-vs-test-set/)

## Cross validation

Randomly assign each example to a 'fold'

![](three-CV.svg)

[workshops.tidymodels.org](https://workshops.tidymodels.org)

## Cross validation

::: columns
::: {.column width="75%"}

![](three-CV-iter.svg)
:::

::: {.column width="25%"}
Select model with the best **average** performance across each CV fold
:::
:::

[workshops.tidymodels.org](https://workshops.tidymodels.org)

## Split training set for cross validation

Multiple repeats of CV can further protect against choosing a model that 'got lucky'.

```{r}
#| echo: true
set.seed(456)
ctg_folds <- vfold_cv(training(ctg_split_strat), 
         v=3, #three folds
         strata = fetal_health, #stratified on outcome
         repeats = 2) #two repeats
ctg_folds
```

## Evaluating models

-   Performance metrics identify how 'good' the model is for given dataset

    -   Perfectly predicting outcome = perfect performance
    -   Error metrics: smaller is better (0 error = perfect)
    -   Positive metrics: bigger is better

-   Metrics differ for regression vs. classification

## Regression metrics

::: columns
::: {.column width="50%"}
For each example, error is difference between actual outcome $y$ and predicted outcome $\hat{y}$

-   Mean absolute error: $\frac{1}{n} \sum \mid y - \hat{y} \mid$
-   Mean squared error: $\frac{1}{n} \sum [y - \hat{y}]^2$
-   Root mean squared error: $\sqrt{\frac{1}{n} \sum [y - \hat{y}]^2}$
:::

::: {.column width="50%"}
![](rmse_vs_mae.webp)
:::
:::

## Classification metrics: confusion matrix-based

::: columns
::: {.column width="65%"}
**After dichotomizing predicted risk**, you can create a confusion matrix

Many metrics derived from it

-   Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$

-   Precision: $\frac{TP}{PT+FP}$

-   Recall: $\frac{TP}{TP+FN}$
:::

::: {.column width="35%"}
![](confusion_matrix.webp)

[Harikrishnan N B Medium](https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd#:~:text=What%20is%20the%20accuracy%20of,the%20accuracy%20will%20be%2085%25.)
:::
:::

::: callout-caution
## Often, we do not want to dichotomize before selecting a model
:::

## Area under the ROC curve

::: columns
::: {.column width="60%"}
Measure of **discrimination**: Higher if the examples with a positive outcome were assigned a higher risk score

-   -,-,-,-,-,+,+,+,+ $\rightarrow$ AUC=1 (perfectly discriminated +'s from -'s)

-   -,+,-,+,-,+,-,+,-,+ $\rightarrow$ AUC = 0.5 (random chance)
:::

::: {.column width="40%"}
![](ROC_curve.png)
:::
:::

Does not measure **calibration** (how closely the predicted probabilities match actual risk)

## Performance in train, test, validate data

-   Performance on [**training data**]{.underline} optimistic due to overfitting

    -   Use for nothing

-   Top model performance on **validation data** (e.g., cross validation folds) can still be optimistic if many configurations assessed

    -   Use for selection only

-   Performance on [**test data**]{.underline} only unbiased measure of performance

## [tidymodels](https://www.tidymodels.org/find/parsnip/)

![](parsnip_models.png)

## Random forest model

-   Ensembles (combines) many decision trees

-   To develop each tree, model randomly selects:

    -   Which training examples to use (bootstrapping)

    -   Random subset of covariates to consider for each branch

-   Tree 'votes' are counted to estimate probability of outcome

## Decision tree example

![](tree-example.svg)

Outcome: hockey shot on goal. [workshops.tidymodels.org](https://workshops.tidymodels.org)

## Random forest in tidymodels

![](rand_forest_parsnip.png)

## Random forest arguments

![](parsnip_RF_arguments.png)

## Tuning random forest model

```{r}
#| echo: true

#specify a recipe (prediction task as formula; 
#.  can also include preprocessing)
ctg_recipe <- recipe(fetal_health~., data=dt)

#specify random forest model for tuning
rf_tune_spec <- rand_forest(mtry = tune(),
                       trees = 1000,
                       min_n = tune(),
                       mode = "classification")
rf_tune_spec
```

## Grid of hyperparameter settings

```{r}
#| echo: true
#Create grid of hyperparameters for tuning
rf_grid <- grid_regular(
  mtry(range = c(5, 15)),#number of predictors sampled at each split of tree
  min_n(range = c(2, 8)),#Minimum datapoints in node for further split
  levels = 3
)
rf_grid
```

## Develop & select random forest model

```{r}
#| echo: true
# Tune the model
set.seed(456)
rf_tune_results <- tune_grid(
  rf_tune_spec,
  ctg_recipe,
  resamples = ctg_folds,
  grid = rf_grid
)
```

-   2 repeats of 3 fold cross validation

-   Grid search \[5,10,15\] as \# randomly-selected predictors at each branch and \[2, 5, 8\] as minimum node size

How many random forest models will we train before selecting the top model?

## Compare AUC by hyperparameter setting

```{r}
#| echo: true
autoplot(rf_tune_results)
```

## Evaluate top configuration in test set

```{r}
#| echo: true
show_best(rf_tune_results, metric="roc_auc", n=3)
best_auc <- select_best(rf_tune_results, metric = "roc_auc")

#Specify a model with best hyperparameters
rf_best_spec <- rand_forest(mtry = best_auc$mtry,
                       trees = 1000,
                       min_n = best_auc$min_n,
                       mode = "classification") |>
  set_engine("ranger", importance = "impurity")

#Trains top configuration on all training set; predict on test set
rf_test_results <- last_fit(
  rf_best_spec,
  ctg_recipe,
  split = ctg_split_strat)
```

## Predict on test set

```{r}
#| echo: true
# Estimate unbiased performance on test set
rf_test_results %>% collect_metrics()
# Compare predicted risk to actual outcome
preds <- predict(extract_workflow(rf_test_results), testing(ctg_split_strat), type="prob")
dt_pred_outcome <- cbind(preds,
                         truth =testing(ctg_split_strat)$fetal_health)
head(dt_pred_outcome,5)
```

## Plot the ROC curve

::: columns
::: {.column width="50%"}
```{r}
#| echo: true
roc <- roc_curve(dt_pred_outcome, 
                 truth, 
                 .pred_Abnormal)
head(roc, 9)
#autoplot(roc)
```
:::

::: {.column width="50%"}
```{r}
#| echo: false
#| fig-width: 5
#| fig-height: 5
autoplot(roc)
```
:::
:::

## Assess calibration

```{r}
#| echo: true
calibration_obj <- caret::calibration(truth ~ .pred_Abnormal, 
                                      data = dt_pred_outcome,
                                      cuts = 6)
ggplot(calibration_obj)
```

::: columns
::: {.column width="60%"}
```{r}
#| echo: false
ggplot(calibration_obj)

```
:::

::: {.column width="40%"}
-   For perfectly calibrated model, midpoint of bins fall on the diagonal line

-   Interpretation: risk predictions above \~12% may be overestimates
:::
:::

## Plot variable importance

```{r}
#| echo: true
rf_test_results |>
  extract_fit_parsnip() |>
  vip(num_features = 10)

```

## Model development takeaways

-   Separates training, validation/comparison/selection, and testing to avoid over-/under-fitting

-   Unbiased performance estimate comes from test data not used for training or selection

-   In most applied projects, rigorous model development \>\>\> in-depth understanding of algorithms

## Agenda

-   Types of supervised learning
-   Model development and selection
-   **Clinically useful models**

## Switching to Powerpoint

<https://github.com/altonrus/EMD601_ML_guest_lecture/blob/master/clinically_useful_ML_prediction.pptx>