-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLab7_Challenge.Rmd
154 lines (118 loc) · 4.31 KB
/
Lab7_Challenge.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
title: 'Lab 7: Challenge'
author: "Nathan Diekema"
date: "11/15/2021"
output:
rmdformats::downcute:
lightbox: true
self_contained: true
gallery: true
toc_depth: 3
highlight: github
df_print: paged #kable
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
fig.align = "center",
fig.height = 5,
fig.width = 10,
message = FALSE,
warning = FALSE
)
```
**Load Libraries & Data**
```{r}
library(tidyverse)
library(tidymodels)
library(kknn)
library(ISLR)
library(DT)
ha <- read_csv("https://www.dropbox.com/s/aohbr6yb9ifmc8w/heart_attack.csv?dl=1")
ha
```
**Data Exploration/Cleaning**
```{r}
set.seed(2012)
ha <- ha %>%
mutate(
sex = as.factor(sex),
cp = as.factor(cp),
restecg = as.factor(restecg),
output = as.factor(output)
)
```
**Setting up the model**\
```{r}
# Logistic Classification Model
log_model <- logistic_reg() %>%
set_mode("classification") %>%
set_engine("glm")
log_rec <- recipe(output ~ age + chol + sex + cp + thalach + trtbps, data = ha) %>%
step_dummy(sex) %>%
step_dummy(cp)
log_wflow <- workflow() %>%
add_recipe(log_rec) %>%
add_model(log_model)
# Fit the model to training data
log_fit <- fit(log_wflow, data=ha)
log_fit %>% pull_workflow_fit()
# KNN Model
knn_model <- nearest_neighbor(neighbors = 75) %>%
set_mode("classification") %>%
set_engine("kknn")
knn_rec <- recipe(output ~ age + sex + cp + thalach, data = ha) %>%
step_dummy(sex) %>%
step_dummy(cp) %>%
step_normalize(all_numeric())
knn_wflow <- workflow() %>%
add_recipe(knn_rec) %>%
add_model(knn_model)
# Fit the model to training data
knn_fit <- fit(knn_wflow, data=ha)
knn_fit %>% pull_workflow_fit()
```
**Load in validation data**\
```{r}
ha_validation <- read_csv("https://www.dropbox.com/s/jkwqdiyx6o6oad0/heart_attack_validation.csv?dl=1")
ha_validation <- ha_validation %>%
mutate(
sex = as.factor(sex),
cp = as.factor(cp),
restecg = as.factor(restecg),
output = as.factor(output)
)
ha_validation
```
**Get predictions and organize into dataframes**\
```{r}
# Log
pred_log_val <- ha_validation %>%
select(output) %>%
bind_cols(
predict(log_fit, ha_validation),
predict(log_fit, ha_validation, type = "prob")
) %>%
rename(truth=output, predicted=.pred_class)
# KNN
pred_knn_val <- ha_validation %>%
select(output) %>%
bind_cols(
predict(knn_fit, ha_validation),
predict(knn_fit, ha_validation, type = "prob")
) %>%
rename(truth=output, predicted=.pred_class)
```
# Challenge: Cohen’s Kappa
Use online resources to research this measurement. Calculate it for the models from Part One, Q1-2, and discuss reasons or scenarios that would make us prefer to use this metric as our measure of model success. Do your conclusions from above change if you judge your models using Cohen’s Kappa instead? Does this make sense?
```{r}
# Logistic Model
df <- data.frame(rbind(kap(pred_log_val, truth = truth,
estimate = predicted),
kap(pred_knn_val, truth = truth,
estimate = predicted)))
rownames(df) <- c("Logistic Model", "KNN Model")
df
```
Cohen's Kappa gives information on how much better the model is than it would achieve if simply guessing at random. This is a good metric because it takes data imbalances into account which is not done by the overall accuracy. The value varies from -1 to 1 with values below 0 indicating the classifier is useless and values from 0-1 indicating a progressively more effective classifier. The table above has the cohen kappa values for each of our classifiers. The logistic model performed better with a value 0.524 which typically means the model is moderately good at classifying patients. The KNN model on the other hand achieved a value of 0.43 which indicated it is a slightly less accurate classifier than the logistic model. This aligns with what we determined earlier, the logistic model has a better overall accuracy, precision, and specificity. I'm guessing that if I were to reduce the value of k neighbors in the KNN model that this metric would increase, The only reason k is so high is because I was trying to maximize roc_auc when building the models.