-
Notifications
You must be signed in to change notification settings - Fork 3
/
03-Classification_in_R.Rmd
441 lines (356 loc) · 16 KB
/
03-Classification_in_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Classification in R"
author: "Dani Cosme"
date: "February 7, 2018"
output:
md_document:
preserve_yaml: true
toc: true
toc_depth: 2
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
```
# Classification
Last week we focused on using machine learning to predict a continuous outcome. This week we'll concentrate on predicting, or classifying, a dichotomous variable. We'll compare two different classification algorithms:
* Support vector machine (SVM) classifier
* LASSO logistic regression
## Example using SVM classification in caret
Today we'll use a [dataset from Kaggle](https://www.kaggle.com/jboysen/us-perm-visas) with lots information about US permanent visas applications. The goal is to predict the outcome of the application from a bunch of variables.
#### Getting set up
First, let's load some packages.
```{r}
# load up our packages
installif <- function(packages){
# give me a character vector of packages
for(p in packages){
# and I will check if we have them
if(!require(p, character.only = TRUE)){
# install them if we don't
install.packages(p)
}
# and load them if all goes well.
require(p, character.only = TRUE)
}
}
package_list <- c("tidyverse", "caret", "ROCR", "pROC", "knitr", "glmnet", "e1071")
installif(package_list)
```
Next, let's load the data.
```{r}
# clear the environment, just to be safe
rm(list=ls())
# load data
data = read.csv('files/us_perm_visas_10K.csv', stringsAsFactors = FALSE)
```
Time to tidy the data.
```{r}
# remove columns with no useful data (i.e. all observations are NA or has no text)
#data.reduced = data[,(colSums(is.na(data)) < nrow(data)) & (colSums(data == '') < nrow(data) | is.na(colSums(data == '')))]
# do we have to keep the broken original df? -jls
data <- data[,(colSums(is.na(data)) < nrow(data)) & (colSums(data == '') < nrow(data) | is.na(colSums(data == '')))]
# change text to upper case, select unique rows, recode dollar amounts, and move outcome variable to the first column
data = data %>%
as_tibble() %>%
mutate(pw_amount_9089 = gsub(",", "", pw_amount_9089)) %>%
extract(pw_amount_9089, "pw_amount_9089", regex = "([0-9]{1,}).00") %>%
mutate(pw_amount_9089 = as.integer(pw_amount_9089)) %>%
mutate_if(is.character, funs(toupper)) %>%
mutate_if(is.character, as.factor) %>%
unique(.) %>%
select(case_status, everything())
# sry am bad at pipe syntax -jls
data$case_received_date <- as.Date(data$case_received_date)
data$decision_date <- as.Date(data$decision_date)
```
For simplicity, let's limit ourselves to cases that are either certified or denied.
```{r}
# check levels
levels(data$case_status)
# remove withdrawn cases and relevel
data.relevel = data %>%
filter(!case_status %in% "WITHDRAWN") %>%
mutate(case_status = as.factor(as.character(case_status)))
# double check that it worked
levels(data.relevel$case_status)
```
Let's take a look at the predictor variables.
```{r}
head(data.relevel)
```
For the sake of time, we're going to reduce the data in a couple of ways. There are more sophisticated ways to reduce the dimensionality while retaining the information from predictors (e.g. using PCA), but I'm just simply going to select the first 10 predictors. (Choosing some complete/potentially useful variables)
```{r}
data.chop <- data.relevel[,c("case_status","case_received_date", "decision_date", "job_info_training",
"job_info_foreign_ed", "employer_num_employees", "job_info_foreign_lang_req",
"foreign_worker_info_education", "fw_info_rel_occup_exp",
"fw_info_req_experience", "job_info_experience")]
data.chop$case_received_date <- as.Date(data.chop$case_received_date)
data.chop$decision_date <- as.Date(data.chop$decision_date)
data.chop <- data.chop[complete.cases(data.chop),]
```
Next, let's reduce the number of observations we have while still making sure that we have enough observations in each outcome category.
Let's check the base rate of each outcome.
```{r}
round(table(data.chop$case_status)/nrow(data.chop),3)
```
There are very few denials (lucky applicants!), but this will cause problems for us later. We're going to oversample so that we have a 10% denial rate.
```{r}
set.seed(6523)
n_denied <- nrow(data.chop[data.chop$case_status == "DENIED",])
# sample separtely within each level
cert = data.chop %>%
filter(case_status %in% "CERTIFIED") %>%
sample_n(n_denied*9)
den = data.chop %>%
filter(case_status %in% "DENIED")
#sample_n(192) # gonna use all the minority samples we can
# join samples
data.ml = bind_rows(cert,den)
# check proportions
round(table(data.ml$case_status)/nrow(data.ml),3)
```
#### Overview of steps
1. Split the data into training and test samples
2. Set training parameters (e.g. number of k-folds)
3. Run model
4. Inspect fit indices and accuracy
5. Adjust model (if necessary)
6. Apply model to test data to assess out of sample accuracy
#### Splitting the data
We want to both develop a model and assess how well it can predict application status in a separate sample, so we'll split our data into training and test datasets. Let's use 75% of the data in the training sample and the remaining 25% in the test sample.
To do this, we'll use the `createDataPartition()` function in caret, which samples randomly within level of our outcome variable. This way we have the same proportion of outcomes in the training and test samples.
```{r}
# set a seed so we all get the same dataframes
set.seed(6523)
# split the data based on the outcome case_status
in.train = createDataPartition(y = data.ml$case_status,
p = .75,
list = FALSE)
# check that it's actually 75%
nrow(in.train) / nrow(data.ml)
# subset the training data
training = data.ml[in.train,]
# subset the test data (i.e. not in `in.train`)
test = data.ml[-in.train,]
# check proportions
round(table(training$case_status)/nrow(training),3)
round(table(test$case_status)/nrow(test),3)
```
Before we begin training the classifier, let's setup our training parameters using `trainControl()`. For the sake of time, let's use a 3-fold cross-validation. You may want to select more folds and/or repeat the k-fold cross-validation with several different samples. To do that you'd specify `method = "repeatedcv"` and `repeats = [n repeats]`. However, to save time, we'll just do a single 3-fold cross-validation. We also want to output the classification probabilities so that we can use the "ROC" metric below and save the predictions.
```{r}
train.control = trainControl(method = "cv",
number = 3,
classProbs = TRUE,
savePredictions = TRUE)
```
#### Train the model
Now, we'll train a support vector machine (i.e. `method = "svmLinear"`) to predict our outcome `case_status` from all variables in the training dataset (i.e. `case_status ~ .`) using the training parameters we specified above (i.e. `train.control`). The rest of the inputs are as follows:
* `na.action = na.pass` allows NAs to pass through the model without crashing
* `preProcess = c("center", "scale")` centers and scales the predictors so that they're on the same scale
* `metric = "Accuracy"` means that we'll use accuracy to select the optimal model
```{r}
fit.svc <- train(case_status ~ .,
data = training,
method = "svmLinear",
trControl = train.control,
na.action = na.pass,
preProcess = c("center", "scale"),
metric = "Accuracy")
# fit.svc = readRDS("files/fit.svc") # if you try to run the model and it's taking too long, you can load the data with this command
# saveRDS(fit.svc, "files/fit.svc") # code to save the model
```
#### Assess the model
First, let's check the mean accuracy of the model across cross-validation folds.
```{r}
fit.svc
```
Let's check out the classification accuracy and kappa values on each fold.
```{r}
fit.svc$resample
```
Let's unpack this a little further and look at our false positive and false negative rates using the confusion matrix.
```{r}
confusionMatrix(fit.svc)
```
We can also visualize this by looking at the receiver operator curve.
```{r}
plot(roc(predictor = fit.svc$pred$CERTIFIED, response = fit.svc$pred$obs))
```
#### Optimize the model
To try to improve our accuracy, we can adjust various parameters.
First, let's change the selection metric from accuracy to ROC to try to balance sensitivity and specificity.
```{r}
fit.svc.roc <- train(case_status ~ .,
data = training,
method = "svmLinear",
trControl = train.control,
na.action = na.pass,
preProcess = c("center", "scale"),
metric = "Kappa") # says ROC is not an available metric
# fit.svc.roc = readRDS("files/fit.svc.roc") # if you try to run the model and it's taking too long, you can load the data with this command
# saveRDS(fit.svc.roc, "files/fit.svc.roc") # code to save the model
```
Let's compare accuracy and kappa values
```{r}
fit.table = bind_rows(fit.svc$results, fit.svc.roc$results)
row.names(fit.table) = c("Accuracy", "ROC")
kable(fit.table, format = "pandoc", digits = 3)
```
Now let's plot the ROC for both models
```{r}
roc1=roc(predictor = fit.svc$pred$CERTIFIED, response = fit.svc$pred$obs)
roc2=roc(predictor = fit.svc.roc$pred$CERTIFIED, response = fit.svc.roc$pred$obs)
plot(roc1, col = 1, lty = 1, main = "ROC")
plot(roc2, col = 4, lty = 1, add = TRUE)
legend("bottomright", legend = c("ACC", "ROC"), col = c(1,4), lty = c(1,1), bty = "n")
```
Using the first model we ran, let's tune the cost function (C).
```{r, warning=FALSE}
# This doesn't work for me, turning warnings off off -jls
# specify different values to assign to the cost function
grid = expand.grid(C = c(0, 0.01, 0.05, 0.25, 0.75, 1, 1.5, 2,5))
fit.svc.tune = train(case_status ~ .,
data = training,
method = "svmLinear",
trControl = train.control,
na.action = na.pass,
preProcess = c("center", "scale"),
metric = "Accuracy",
tuneGrid = grid,
tuneLength = 10)
# fit.svc.tune = readRDS("files/fit.svc.tune") # if you try to run the model and it's taking too long, you can load the data with this command
# saveRDS(fit.svc.tune, "files/fit.svc.tune") # code to save the model
```
Let's check the model results
```{r}
fit.svc.tune
```
And plot the accuracy as a function of the cost parameter C
```{r}
plot(fit.svc.tune)
```
Now let's plot the ROC for all three models
```{r}
roc1=roc(predictor = fit.svc$pred$CERTIFIED, response = fit.svc$pred$obs)
roc2=roc(predictor = fit.svc.roc$pred$CERTIFIED, response = fit.svc.roc$pred$obs)
roc3=roc(predictor = fit.svc.tune$pred$CERTIFIED, response = fit.svc.tune$pred$obs)
plot(roc1, col = 1, lty = 1, main = "ROC")
plot(roc2, col = 4, lty = 1, add = TRUE)
plot(roc2, col = 2, lty = 2, add = TRUE)
legend("bottomright", legend = c("ACC", "ROC", "TUNED"), col = c(1,4,2), lty = c(1,1,2), bty = "n")
```
And compare accuracy
```{r}
fit.table = bind_rows(fit.svc$results, fit.svc.roc$results, filter(fit.svc.tune$results, C == .05))
row.names(fit.table) = c("Accuracy", "ROC", "Tuned")
kable(fit.table, format = "pandoc", digits = 3)
```
#### Test in holdout sample
Let's apply the best fitting model to the test data and see how well it performs in a new sample
```{r}
# get predicted values for the test data
test.pred = predict(fit.svc, newdata = test)
```
Let's apply the best fitting model to the test data and see how well it performs in a new sample
```{r}
# get predicted values for the test data
test.pred = predict(fit.svc, newdata = test)
```
To assess the performance, let's check out the confusion matrix
```{r}
confusionMatrix(test.pred, test$case_status)
```
So how well is this model really performing? By looking at the No Information Rate and the associated p-value, we see that while our model has good verall accuracy, it actually isn't significantly better than simply guessing based on the base rates of the classes.
More on how to interpret the metrics in the confusion matrix [here](https://www.hranalytics101.com/how-to-assess-model-accuracy-the-basics/#confusion-matrix-and-the-no-information-rate).
### Example using LASSO logisic regression in glmnet
I'm sure there's a way to do this using caret, but I couldn't find a straightforward answer, so I'm using the glmnet package. The basic concepts are the same, but the syntax is slightly different. We also need to do some additional tidying to get the data in the correct format for glmnet.
We also need to convert any factors to dummy coded variables in the training and test data (which caret does internally). We'll do that using `dummyVars()`.
```{r}
# dummy code training data
dummy = dummyVars(" ~ .", data = data.ml[,-1], fullRank = TRUE)
training.dummy = data.frame(predict(dummy, newdata = training))
training.dummy$case_status = training$case_status
training.dummy = training.dummy %>%
select(case_status, everything())
# dummy code testing data
test.dummy = data.frame(predict(dummy, newdata = test))
test.dummy$case_status = test$case_status
test.dummy = test.dummy %>%
select(case_status, everything())
# print names
names(training.dummy)
```
```{r}
# subset predictors and criterion and save as matrices
x_train = as.matrix(training.dummy[,-1])
y_train = as.matrix(training.dummy[, 1])
```
Run the logistic regression model with 3 cross-validation folds, an alpha of 1 (i.e. the LASSO penalty, 0 = ridge penalty), scaled (i.e. standardize) predictors, using area under the ROC curve as our metric. This will allow us to determine what lambda parameter to use.
```{r}
fit.log = cv.glmnet(x_train, y_train,
family='binomial',
alpha=1,
standardize=TRUE,
type.measure='auc',
nfolds = 3)
```
Plot lambda versus fit metric AUC and print best lambda parameters
```{r}
# plots
plot(fit.log)
plot(fit.log$glmnet.fit, xvar="lambda", label=TRUE)
# print lambdas
fit.log$lambda.min
fit.log$lambda.1se
```
Let's check the coefficient matrix to see which variables were shrunk
```{r}
coef(fit.log, s=fit.log$lambda.min)
coef(fit.log, s=fit.log$lambda.1se)
```
Now let's use the best lambda generated from running that model and apply it to our training sample
```{r}
predicted.log = predict(fit.log, newx = x_train, s=fit.log$lambda.1se, type="response")
```
Let's figure out what cut point to use to determine whether a trial should be classified as CERTIFIED or DENIED
```{r}
# plot cutoff v. accuracy
predicted = prediction(predicted.log, y_train, label.ordering = NULL)
perf = performance(predicted, measure = "acc")
perf.df = data.frame(cut=perf@x.values[[1]],acc=perf@y.values[[1]])
ggplot(perf.df, aes(cut, acc)) +
geom_line()
# plot false v. true positive rate
perf = performance(predicted, measure = "tpr", x.measure = "fpr")
perf.df = data.frame(cut=perf@alpha.values[[1]],fpr=perf@x.values[[1]],tpr=perf@y.values[[1]])
ggplot(perf.df, aes(fpr, tpr)) +
geom_line()
# plot specificity v. sensitivity
perf = performance(predicted, measure = "sens", x.measure = "spec")
perf.df = data.frame(cut=perf@alpha.values[[1]],sens=perf@x.values[[1]],spec=perf@y.values[[1]])
ggplot(perf.df, aes(spec, sens)) +
geom_line()
ggplot(perf.df, aes(x = cut)) +
geom_line(aes(y = sens, color = "sensitivity")) +
geom_line(aes(y = spec, color = "specificity"))
cut = perf@alpha.values[[1]][which.max(perf@x.values[[1]]+perf@y.values[[1]])]
# recode values based on cut
predicted.cut = predict(fit.log, newx = x_train, s=fit.log$lambda.1se, type="response")
predicted.cut[predicted.cut >= cut] = "DENIED"
predicted.cut[predicted.cut < cut] = "CERTIFIED"
```
Let's take a look at the confusion matrix
```{r}
confusionMatrix(predicted.cut, y_train)
```
## Other stuff that should be in this document
### Classifiers
#### Support vector machine classifier
#### LASSO logistic regression
### Concepts
#### Balancing sensitivity and specificity
#### Cut points
#### Tuning
#### Reciever operator curves
#### Confusion matrices
#### Interpreting weights