-
Notifications
You must be signed in to change notification settings - Fork 5
/
05_case_study.Rmd
395 lines (288 loc) · 10.5 KB
/
05_case_study.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "A predictive modeling case study"
output:
html_document:
toc: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
options(tibble.print_min = 5)
```
Get started with building a model in this R Markdown document that accompanies [A predictive modeling case study](https://www.tidymodels.org/start/case-study/) tidymodels start article.
If you ever get lost, you can visit the links provided next to section headers to see the accompanying section in the online article.
Take advantage of the RStudio IDE and use "Run All Chunks Above" or "Run Current Chunk" buttons to easily execute code chunks. If you have been running other tidymodels articles in this project, restart R before working on this article so you don't run out of memory on RStudio Cloud.
## [Introduction](https://www.tidymodels.org/start/models/#intro)
Let's put everything we learned from each of the previous [Get Started](https://www.tidymodels.org/start/) articles together and build a predictive model from beginning to end with data on hotel stays.
Load necessary packages:
```{r}
library(tidymodels)
# Helper packages
library(readr) # for importing data
library(vip) # for variable importance plots
```
## [The Hotel Bookings Data](https://www.tidymodels.org/start/case-study/#data)
Let's read the hotel data into R and randomly select 30% of the rows in the data set to avoid long computation times later on.
Note that your results will differ from the original article, since you are only using 30% of the data.
```{r}
# Fix the random numbers by setting the seed
# This enables the analysis to be reproducible when random numbers are used
set.seed(123)
hotels <-
read_csv('https://tidymodels.org/start/case-study/hotels.csv') %>%
mutate_if(is.character, as.factor) %>%
# randomly select rows
slice_sample(prop = 0.30)
dim(hotels)
glimpse(hotels)
```
Let's look at proportions of hotel stays that include children and/or babies:
```{r}
hotels %>%
count(children) %>%
mutate(prop = n/sum(n))
```
## [A first model: penalized logistic regression](https://www.tidymodels.org/start/case-study/#first-model)
Do you recall [Evaluate your model with resampling](/start/resampling/#data-split) article for data splitting?
Let's reserve 25% of the `hotels` data for the test set:
```{r}
set.seed(123)
splits <- initial_split(hotels, strata = children)
hotel_other <- training(splits)
hotel_test <- testing(splits)
# training set proportions by children
hotel_other %>%
count(children) %>%
mutate(prop = n/sum(n))
# test set proportions by children
hotel_test %>%
count(children) %>%
mutate(prop = n/sum(n))
```
Now let's reserve another 20% of the `hotel_other` for our validation set.
```{r}
set.seed(234)
val_set <- validation_split(hotel_other,
strata = children,
prop = 0.80)
val_set
```
### Build the model
Let's specify a penalized logistic regression model using the lasso method.
Note that we define `penalty = tune()` so we can tune it in the next steps, and since we are using lasso method, we set `mixture = 1`.
```{r}
lr_mod <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
```
For more details try typing `?logistic_reg` on the console.
### Create the recipe
Remember the second article [Preprocess your data with recipes](https://www.tidymodels.org/start/recipes)?
Let's preprocess the data by creating a recipe:
```{r}
holidays <- c("AllSouls", "AshWednesday", "ChristmasEve", "Easter",
"ChristmasDay", "GoodFriday", "NewYearsDay", "PalmSunday")
lr_recipe <-
recipe(children ~ ., data = hotel_other) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date, holidays = holidays) %>%
step_rm(arrival_date) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors())
```
### Create the workflow
Let's bundle the model and recipe into a single `workflow()`:
```{r}
lr_workflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(lr_recipe)
```
### Create the grid for tuning
We can now tune our model, similar to what is shown in the previous article [Tune model parameters](https://www.tidymodels.org/start/tuning).
Let's create a grid with 30 values for the hyperparameter we would like to tune:
```{r}
lr_reg_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
lr_reg_grid %>% top_n(-5) # lowest penalty values
lr_reg_grid %>% top_n(5) # highest penalty values
```
### Train and tune the model
Let's train all these logistic regression models with 30 different hyperparameter values.
We provide the validation set `val_set`, so model diagnostics computed on `val_set` will be available after the fit.
```{r}
lr_res <-
lr_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE, verbose = TRUE),
metrics = metric_set(roc_auc))
lr_res
```
Now visualize the validation set metrics by plotting the area under the ROC curve against the range of penalty values:
```{r}
lr_plot <-
lr_res %>%
collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
ylab("Area under the ROC Curve") +
scale_x_log10(labels = scales::label_number())
lr_plot
```
Get the best values for this hyperparameter:
```{r}
top_models <-
lr_res %>%
show_best("roc_auc", n = 15) %>%
arrange(penalty)
top_models
```
Let's pick candidate model 12 with a penalty value of `0.00137`:
Note that because you are using less data, your mean ROC AUC will be slightly lower than what's shown in the article.
```{r}
lr_best <-
lr_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(12)
lr_best
```
And visualize the validation set ROC curve:
```{r}
lr_auc <-
lr_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(children, .pred_children) %>%
mutate(model = "Logistic Regression")
autoplot(lr_auc)
```
## [A second model: tree-based ensemble](https://www.tidymodels.org/start/case-study/#second-model)
Let's try to improve our prediction performance by using a random forest model (model *type*), which we also explored in the [Evaluate your model with resampling](https://www.tidymodels.org/start/resampling/) article.
Check number of cores to work with:
```{r}
cores <- parallel::detectCores()
cores
```
Set model specification and provide number of cores for parallelization while tuning.
```{r}
rf_mod <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_engine("ranger", num.threads = cores) %>%
set_mode("classification")
```
### Create the recipe and workflow
Let's create the recipe for the model.
```{r}
rf_recipe <-
recipe(children ~ ., data = hotel_other) %>%
step_date(arrival_date) %>%
step_holiday(arrival_date) %>%
step_rm(arrival_date)
```
Then bundle it with the model specification:
```{r}
rf_workflow <-
workflow() %>%
add_model(rf_mod) %>%
add_recipe(rf_recipe)
```
### Train and tune the model
When we set up our parsnip model, we chose two hyperparameters for tuning:
```{r}
rf_mod
# show what will be tuned
rf_mod %>%
parameters()
```
We will use a space-filling design to tune with 12 candidate models (instead of 25, to reduce computation time).
Be patient here! Computing these results will take several minutes to complete if you are using the default RStudio Cloud resources (1 GB memory, 1 CPU).
```{r}
set.seed(345)
rf_res <-
rf_workflow %>%
tune_grid(val_set,
grid = 12,
control = control_grid(save_pred = TRUE, verbose = TRUE),
metrics = metric_set(roc_auc))
```
Here are our top 5 random forest models, out of the 12 candidates:
Note that your results will be different and your accuracy will take a small hit since you are using less data (only 25% of the whole data set) and setting up a smaller grid to tune model hyperparameters.
```{r}
rf_res %>%
show_best(metric = "roc_auc")
```
But we're already getting much better results than our penalized logistic regression!
Let's plot the results:
```{r}
autoplot(rf_res)
```
Let's select the best model according to the ROC AUC metric. Our final tuning parameter values are:
```{r rf-best}
rf_best <-
rf_res %>%
select_best(metric = "roc_auc")
rf_best
```
Collect predictions for the best model:
Note that we simply provide our best model's parameter values `rf_best` to subset it from a whole list of tuned models.
```{r}
rf_res %>%
collect_predictions()
rf_auc <-
rf_res %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(children, .pred_children) %>%
mutate(model = "Random Forest")
```
Now, we can compare the validation set ROC curves for our top penalized logistic regression model and random forest model:
```{r}
bind_rows(rf_auc, lr_auc) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) +
geom_path(lwd = 1.5, alpha = 0.8) +
geom_abline(lty = 3) +
coord_equal() +
scale_color_viridis_d(option = "plasma", end = .6)
```
The random forest is uniformly better across event probability thresholds.
## [The last fit](https://www.tidymodels.org/start/case-study/#last-fit)
Let's evaluate the model performance one last time with the held-out test set.
We'll start by building our parsnip model object again from scratch with our best hyperparameter values from our random forest model:
```{r}
# the last model
last_rf_mod <-
rand_forest(mtry = 8, min_n = 7, trees = 1000) %>%
set_engine("ranger", num.threads = cores, importance = "impurity") %>%
set_mode("classification")
# the last workflow
last_rf_workflow <-
rf_workflow %>%
update_model(last_rf_mod)
# the last fit
set.seed(345)
last_rf_fit <-
last_rf_workflow %>%
last_fit(splits)
last_rf_fit
```
Note that we added a new argument `importance = "impurity"` to `set_engine` to get variable importance scores. This is an optional, engine-specific argument. To see its documentation, you need to read the documentation for the underlying `ranger()` function. To see it and other options, type `?ranger` in console.
Now let's collect the metrics:
```{r}
last_rf_fit %>%
collect_metrics()
```
Now let's `pluck` the workflow and pull out the fit and visualize the variable importance scores for the top 20 features:
```{r}
last_rf_fit %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 20)
```
Let's generate our last ROC curve to visualize:
```{r}
last_rf_fit %>%
collect_predictions() %>%
roc_curve(children, .pred_children) %>%
autoplot()
```
Not bad!