-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathvignette.qmd
556 lines (362 loc) · 27.4 KB
/
vignette.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
---
title: "Random Forest Vignette"
format:
html:
toc: true
toc_float:
collapsed: false
smooth_scroll: true
---
## Authors' Summary
This document is a vignette of random forest algorithms. We talk about the inner workings of random forest models, going as deep as describing the process of how decision trees decide node splits. We also demonstrate how to implement a random forest model in code and interpret its results for both the classification and regression cases.
For students only interested in learning how to implement the code for random forest models, we recommend going straight to sections "Classification Case: Titanic Survival Prediction" and "Regression Case: Miles Per Gallon Prediction". The section named "Under the Hood of Random Forest" is for students wanting to understand how the random forest algorithm works.
## Table of Contents
1. Under the Hood of Random Forest
2. Classification Case: Titanic Survival Prediction
3. Regression Case: Miles Per Gallon Prediction
4. Random Forest Checklist
5. References
## Under the Hood of Random Forest
This section of our vignette is for those who want to understand the inner workings of random forest algorithms. Note that random forest is an ensemble method, i.e., it combines the results of multiple decision trees, so before we go over a high-level overview of the random forest algorithm, we will first learn about some intuition behind decision trees.
This section will go over the following:
1. Intuition Behind Decision Trees
2. Overview of the Random Forest Algorithm
### Intuition Behind Decision Trees
Note that we will explain decision trees in the context of solving a classification problem. A decision tree is a rule-based algorithm that systematically divides the predictor space, i.e., the target variable, using a set of rules to split the data points into homogeneous groups. Inner and root nodes represent when a rule is applied to split a predictor (considering that a decision tree follows a binary tree structure). One branch of a node contains the data points that satisfy the node's rule, while the other contains the data points that break the rule.
The goal is to split the target variable into increasingly homogeneous subgroups compared to its parent node. This process continues until no more rules can be applied or no remaining data points. The nodes at the bottom of the decision tree after the splitting process is over are called terminal or leaf nodes.
The decision tree's algorithm attempts to split the data into leaf nodes containing only a single class. These nodes are referred to as pure nodes. Not all the leaf nodes of a decision tree will be completely pure, i.e., some leaf nodes will contain a mix of multiple classes. In this case, a classification is made based on a node's most common data point.
#### How does a decision tree decide how to split?
Let us explain this using an example. Imagine we want to predict a student's exam result based on whether they are taking online classes, student background, and working status. To establish the first split, the decision tree algorithm will iterate through splitting each predictor to determine which split results in the most homogenous or pure nodes, and it will evaluate this using some statistical criterion. Variable selection criterion is done using one of the following approaches:
• Entropy and information gain
• Gini index
It is left to the reader to look further into entropy and information gain, but for the purpose of this vignette, we will only explain the use of the Gini index. The following article will take us to a reference for entropy and information gain, and it is available at: https://towardsdatascience.com/entropy-and-information-gain-b738ca8abd2a.
Let the following be our toy data set for this example.
![Toy Data set for Decision Tree Example](images/toy-dataset2.png)
This is the formula for calculating Gini index:
![Gini Index Formula](images/gini-index-formula.png){width="386"}
Keep in mind that $j$ denotes the number of classes, and $p_j$ signifies the proportion of data points belonging to class $j$ within the current node.
Splitting by student background, we get three possible child nodes: maths, CS, and others.
While 2 people in "Maths" pass, there are 5 that fail. While 4 people in "CS" pass, there are 0 that fail. While 2 people in "Other" pass, there are 2 that fail.
![Student Background Split Condition](images/bkgrd-tree-ex.jpg)
Let us calculate the Gini index of the child nodes of Student Background.
Maths node: 2P, 5F
$$
Gini_{maths} = 1 - (\frac{2}{7})^2 - (\frac{5}{7})^2 = .4082
$$
CS node: 4P, 0F
$$
Gini_{CS} = 1 - (\frac{4}{4})^2 - (\frac{0}{4})^2 = 0
$$
Other node: 2P, 2F
$$
Gini_{other} = 1 - (\frac{2}{4})^2 - (\frac{2}{4})^2 = .5
$$
The overall Gini index of this split is calculated by taking the weighted average of the 3 nodes.
$$
Gini_{bkgrd} = \frac{7}{15}(.4082) + \frac{4}{15}(0) + \frac{4}{15}(.5) = .3238
$$
Similarly, we will calculate the Gini index for 'Work Status' and 'Online Courses' predictors.
$$
Gini_{working} = 1 - (\frac{6}{9})^2 - (\frac{3}{9})^2 = .4444
$$
$$
Gini_{not working} = 1 - (\frac{4}{6})^2 - (\frac{2}{6})^2 = .4444
$$
$$
Gini_{workstatus} = \frac{6}{15}(.4444) + \frac{9}{15}(.4444) = .4444
$$
$$
Gini_{online} = 1 - (\frac{4}{8})^2 - (\frac{4}{8})^2 = .5
$$
$$
Gini_{notonline} = 1 - (\frac{3}{7})^2 - (\frac{4}{7})^2 = .4898
$$
$$
Gini_{onlinecourse} = \frac{7}{15}(.4898) + \frac{8}{15}(.5) = .49524
$$
Since the Gini index is lowest for 'Student Background,' this predictor becomes the basis for splitting the root node. This concludes the logic behind how the split conditions for decision nodes are created.
### Overview of Random Forest Algorithm
The random forest algorithm follows these steps:
1\. Take the original dataset and create $N$ bootstrapped samples of size $n$ such that $n$ is smaller than the size of the original data set.
2\. Train a decision tree for each of the bootstrapped samples, but split on a different subset of the predictors for each tree and determine the best split using impurity measures such as Gini impurity or Entropy.
3\. Create a prediction by aggregating the results of all the trees. In the classification case, take the majority vote across all trees, and in the regression case, take the average across all trees.
#### Bias-Variance of Random Forest Algorithms
A model with low-bias performs well in finding the true relationships between a data set's predictors and target variables, and a model with low-variance will do well in generalizing to different input data sets.
Note that bias-variance trade off is an important concept in machine learning. High-bias is error which is most common in models that under fit, and high-variance is error which commonly comes from models which over fit. Our goal is to find the best balance between the two, a model that is not too simple or too complex.
Decision trees tend to over fit to the data, and as a result they have low-bias and high-variance. Note that the decision trees of a random forest algorithm are uncorrelated with one another (since they are each trained on a different subset of predictors), so by aggregating their predictions, we are able to average out this over fitting and reduce variance.
Note that increasing the number of trees in a random forest model retains a low-bias and decreases variance, but it also increases computational costs and run-time.
#### Random Forest Assumptions
One of the main assumptions of random forests is that the decision trees of the random forest ensemble are uncorrelated, i.e., that the features of the data set are independent and that feature selection for creating splits is random. If decision trees were correlated, then the aggregation of these trees will not reduce variance.
## Classification Case: Titanic Survival Prediction
### Prerequisites
Copy and paste the following block of code into a new script to load the required packages and data used for this example. If an error appears, then you likely don't have one of the libraries installed.
```{r, message=FALSE, warning=FALSE}
library(ISLR)
library(ISLR2)
library(tidyverse)
library(tidymodels)
library(forcats)
library(ggthemes)
library(naniar)
library(corrplot)
library(corrr)
library(klaR)
library(ggplot2)
library(vip)
tidymodels_prefer()
titanic_data <- read.csv('data/titanic.csv')
titanic_data_1 <- titanic_data %>%
mutate(survived = factor(survived, levels = c("Yes", "No"))) %>%
mutate(pclass = factor(pclass))
```
### Partition
In the process of model construction, a crucial step involves partitioning the data into training and testing sets. The model originally learns patterns from the training set, while the testing set serves as a benchmark to test the performance of the model on unseen data.
The `initial_split(titanic_data_1, strata = survived, prop = 0.7)` function is used to allocate 70% of the titanic data set into a training set and allocating the other 30% into a testing set while stratifying by the `survived` variable.
Note that `training(parition)` and `testing(parition)` are used to retrieve the training and testing sets, respectively.
```{r}
set.seed(3435)
parition <- initial_split(titanic_data_1, strata = survived, prop = 0.7)
train_set <- training(parition)
test_set <- testing(parition)
```
Note that we need to stratify by survived because there is an uneven proportion of people who survived the titanic versus those who did not.
```{r, echo=FALSE}
train_set %>%
ggplot(aes(x = factor(survived), fill = factor(survived))) +
geom_bar(color = "black", show.legend = FALSE) +
labs(title = "Distribution of Survival in Training Set",
x = "Survived",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1))
```
### K-Folds Cross Validation
K-fold cross validation allows us to train and evaluate the performance of our model on $k$ different partitions of the training set, reducing the risk of over fitting. The `vfold_cv(train_set, v = 5, strata = "survived")` function will create 5 training folds of our training set while stratifying by the `survived` variable. It is left to the reader to look further into k-fold cross validation. Further information can be is available at: https://machinelearningmastery.com/k-fold-cross-validation/.
```{r}
train_folds <- vfold_cv(train_set, v = 5, strata = "survived")
```
### Data Preparation
We will preprocess our data using a recipe from tidymodels. Building a recipe will allow us to provide instructions for preparing and transforming the data before using it to train our model.
The `recipe()` function will initialize the creation of a recipe, setting `survived` as the target variable and `pclass`, `sex`, `age`, `sib_sp`, `parch`, and `fare` as predictors.
```{r}
train.recipe <- recipe(survived ~ pclass + sex + age + sib_sp + parch + fare, data = train_set) %>%
step_impute_linear(age,
impute_with = imp_vars(fare)) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ starts_with('sex'):fare + age:fare)
```
Note that there is some missing data in the `age` variable. Using `step_impute_linear(age, impute_with = imp_vars(fare))`, we will impute missing values of age with linear regression, using `fare` as a predictor.
```{r, echo=FALSE}
vis_miss(train_set)
```
Below we create a heat plot to visualize the correlation between the predictors we will be using for our recipe.
```{r, echo=FALSE}
train_set %>%
select(where(is.numeric)) %>%
cor(use = 'pairwise.complete.obs') %>%
corrplot(type = 'lower', diag = FALSE,
method = 'color')
```
As one can see, some of our predictors strongly correlate with the other(s). Thus, in order to capture potentially powerful predictive power through the interactions between highly correlated predictors and avoid inferential misinterpretation, we will include interaction terms in the formula we use to construct our model.
The `step_dummy(all_nominal_predictors())` function simply turns all the nominal predictors into dummies.
### Model Fitting
The first step is to specify the model's hyper parameters, engine, and mode.
```{r}
rf_class_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
```
The `rand_forest(mtry = tune(), trees = tune(), min_n = tune())` function is used to initialize a random forest model and specify the hyper parameters being tuned.
- `mtry`: Number of predictors used to train each decision tree in the random forest, e.g., if we want to build decision trees that were each trained on 3 random predictors of the original training set, then we would set `mtry = 3`.
- `trees`: Number of decision trees to be included in the random forest, e.g., if we only wanted our forest to contain 4 decision trees, then we would set `trees = 4`.
- `min_n`: Minimum number of observations required for further splitting, e.g., if the number of observations within a node falls below this threshold during the tree-building process, then further splitting of this node is halted and it becomes a terminal node.
The `set_engine("ranger", importance = "impurity")` function allows us to use the random forest implementation from the "ranger" package, and it set the importance method as "impurity". Note that variable importance is the measure of the contribution that each predictor makes to the predictive performance of the model. The decision trees of a random forest are split on different subsets of predictors, and the impurity method calculates importance by measuring how much each predictor is involved in reducing impurity across all trees.
The `set_mode("classification")` function simply specifies the model as a classification model.
Next, we simply need to define a model building workflow which is going to combine our random forest classification model (`rf_class_spec`) and our data preparation recipe (`train.recipe`).
```{r}
rf_class_wf <- workflow() %>%
add_model(rf_class_spec) %>%
add_recipe(train.recipe)
```
Afterwards, we will set up a tuning grid with the `grid_regular()` function to experiment with different variations of the hyper parameters within their defined ranges and apply the tuning grid to the `tune_grid()` function to find the most optimal configuration.
```{r}
rf_grid <- grid_regular(mtry(range = c(1, 3)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 8)
#tuning the random forest model
tune_rf <- tune_grid(
rf_class_wf,
resamples = train_folds,
grid = rf_grid
)
```
Finally, we will extract the best configuration of the hyper parameters using the **`select_best()`** function, create a finalized version of the random forest workflow with **`finalize_workflow()`** , and fit the finalized model using the entire training set with **`fit()`**.
```{r, warning=FALSE}
best_rf <- select_best(tune_rf)
rf_final <- finalize_workflow(rf_class_wf, best_rf)
rf_final_fit <- fit(rf_final, data = train_set)
```
### Prediction
To make predictions using the trained random forest model (`rf_final_fit`), use the `predict()` function and provide it with new input data (`new_data`). It is required that the structure of the input data aligns with the predictor variables used during the model training process. Here, we will generate predictions from all the input points in our test set.
```{r}
new_data <- test_set[c("pclass", "sex", "age", "sib_sp", "parch", "fare")]
predict(rf_final_fit, new_data)
```
### Accuracy Measures
Now, we check how our model performed on the testing data. We will use the following metrics:
- ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
- Sensitivity (True Positive Rate)
- Specificity (True Negative Rate)
- Binary Accuracy
ROC-AUC is a measure of the model's ability to distinguish between two classes and calculates the area under the curve of a plot made on a graph with Sensitivity on the Y-axis and Specificity on the X-axis. An AUC of 1 indicates a model that can perfectly distinguish between two classes. An AUC of 0.5, however, indicates that the model is randomly guessing whether an observation belongs to a particular class.
Sensitivity measures the proportion of actual positive cases that are correctly identified as such. Specificity measures the proportion of real negative cases that are correctly identified. Binary Accuracy measures the proportion of correct predictions out of every prediction made.
```{r}
augment(rf_final_fit, new_data = test_set) %>%
roc_auc(survived, .pred_Yes)
augment(rf_final_fit, new_data = test_set) %>%
sensitivity(survived, .pred_class)
augment(rf_final_fit, new_data = test_set) %>%
specificity(survived, .pred_class)
augment(rf_final_fit, new_data = test_set) %>%
accuracy(survived, .pred_class)
```
As we can see, the ROC-AUC was 0.901, indicating that our model is powerful for its overall ability to distinguish between those who survived and those who didn't survive the Titanic incident. However, the model's sensitivity, or true positive rate, was mediocre at best with a value of 0.689, indicating that the model was not very good at correctly identifying an actual survivor. That being said, the model's binary accuracy and specificity were both very good (0.866 and 0.976 respectively), so our model was good at correctly identifying whether or not a person did not survive, and in general, the vast majority of the model's predictions were correct.
### Variable Importance Scores
Variable importance scores tell us how influential each variable is to contributing to the model's predictive performance.
The following code block produces a bar chart plotting the importance scores of each variable used in our final random forest model.
```{r}
rf_final_fit %>% extract_fit_parsnip() %>%
vip() +
theme_classic()
```
The chart demonstrates that the single `sex_male` predictor and the combination of `sex_male` and `fare` were the most important predictors in our model. This indicates that gender and fair price had a notable impact on survival rates, with women and passengers who paid more for their tickets likely having higher survival chances.
## Regression Case: Miles Per Gallon Prediction
Please be aware that numerous steps in our regression case of random forests align with those from our classification case. Consequently, we will provide a more concise overview of the regression case when in steps that align in both examples.
### Prerequisites
Copy and paste the following block of code into a new script to load the required packages and data used for this example. If an error appears, then you likely don't have one of the libraries installed (note that some of these packages align with those used in the classification case). We will also set the type of the `origin` variable to factor.
```{r, message=FALSE, warning=FALSE}
library(tidymodels)
library(tidyverse)
library(xgboost)
library(glmnet)
library(ISLR)
library(ISLR2)
library(ranger)
library(vip)
data(Auto)
set.seed(10)
#converting origin to a factor
auto <- tibble(ISLR::Auto) %>%
mutate(origin = factor(origin))
```
### Partition
We will allocate 80% of the data into the training set and the last 20% of the data into the testing set (while stratifying by `mpg`) using the `initial_split(auto, strata=mpg, prop=0.8)` function. We'll retrieve the training and testing sets with `training(auto_split)` and `testing(auto_split)`, respectively.
```{r}
auto_split <- initial_split(auto, strata=mpg, prop=0.8)
auto_train <- training(auto_split)
auto_test <- testing(auto_split)
```
### K-Folds Cross Validation
We will divide the training set into five folds (while stratifying by `mpg`) using the `vfold_cv(auto_train, v = 5, strata = "mpg")` function, allowing our model to train and evaluate across multiple segments of the training data, minimizing the risk of over fitting.
```{r}
auto_folds <- vfold_cv(auto_train, v = 5, strata = "mpg")
```
### Data Preparation
Note that the `recipe()` function allows us to create a set of instructions to preprocess the data before applying it to train a machine learning algorithm.
Using the `recipe(mpg ~., data=auto_train)` function, we will initialize a new recipe, assigning `mpg` as the response and every other variable in the data set as a predictor.
Using the `step_rm(name)` function, we will remove the `name` variable as it is not appropriate to use in our model. Using `step_dummy(all_nominal_predictors())` and `step_normalize(all_predictors())`, we will dummify nominal predictors and normalize all predictors, respectively.
```{r}
recipe_auto <- recipe(mpg ~., data=auto_train) %>%
step_rm(name) %>% #remove name of vehicle
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
```
### Model Fitting
We will initialize a new random forest model and specify the hyper parameters using the `rand_forest(mtry = tune(), trees = tune(), min_n = tune())` function. The hyper parameters of our model are `mtry`, `trees`, and `min_n`. Using `set_engine("ranger", importance = "impurity")`, we will use the random forest implementation from the "ranger" package and set our importance method as "impurity". The `set_mode("regression")` function specifies the random forest model as a regression model.
Refer to the "Model Fitting" section of the classification case for an explanation of the hyper parameters and importance method.
```{r}
rf_auto <- rand_forest(mtry = tune(), #num of preds randomly sampled at each split
trees = tune(), #num of trees
min_n = tune()) %>% # min num of data point in a node
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
```
Next, we define a model building workflow (`rf_auto_wf`) to combine our regression model (`rf_auto`) and our data preparation recipe (`recipe_auto`).
```{r}
rf_auto_wf <- workflow() %>%
add_model(rf_auto) %>%
add_recipe(recipe_auto)
```
The next step is to create our tuning grid (with `grid_regular()`) and tune the model (with `tune_grid()`) to find the most optimal configuration of our hyper parameters.
```{r}
rf_grid_auto <- grid_regular(mtry(range = c(1, 8)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
#fit RF models
tune_auto <- tune_grid(
rf_auto_wf,
resamples = auto_folds,
grid = rf_grid_auto)
```
Let us explore some hyper parameter metrics with the following code blocks.
```{r}
#plot of hyperparameter performance metrics
autoplot(tune_auto) + theme_minimal()
```
```{r, warning=FALSE}
#show top 5 RFs
show_best(tune_auto, n=5)
```
The model seems to have better performance with a smaller minimum node size, the number of trees doesn't appear to affect the performance and the performance of the model plateaus after 4 predictors. The best performing model had 8 randomly sampled predictors, 300 trees, a minimum of 10 data points in a node, and a mean RMSE of 2.905 (we will explain what RMSE means in the "Accuracy Measures" section).
Lastly, we can extract the best configuration of hyper parameters with the `select_best(tune_auto)` function, create a finalized version of the random forest workflow with `finalize_workflow(rf_auto_wf, best_rf_auto)`, and fit our model using the entire training set with `fit(final_auto_model, auto_train)`.
```{r, warning=FALSE}
#save best RF
best_rf_auto <-select_best(tune_auto)
#finalize/fit best RF
final_auto_model <- finalize_workflow(rf_auto_wf, best_rf_auto)
final_auto_model <- fit(final_auto_model, auto_train)
```
### Prediction
To make predictions using the trained random forest model (`final_auto_model`), use the `predict()` function and provide it with new input data (`new_data`). It is required that the structure of the input data aligns with the predictor variables used during the model training process. Here, we will generate predictions from all the input points in our test set.
```{r}
new_data <- auto_test[c("cylinders","displacement","horsepower","weight","acceleration","year","origin", "name" )]
predict(final_auto_model, new_data)
```
### Accuracy Measures
Now, we check how our model performed on the testing data using the RMSE metric.
RMSE is one of the common metrics used for measuring a model's accuracy when dealing with regression problems. RMSE is equal to the square root of the average squared differences between the predicted values and the observed values. Note that a lower value of RMSE indicates better model performance.
```{r}
final_auto_model_test <- augment(final_auto_model, auto_test)
rmse(final_auto_model_test, truth=mpg, .pred)
```
From the result above, the RMSE of 2.007913 indicates that, on average, the model's predictions are 2.007913 units away from their observed values. The model's goodness of fit based off the RMSE depends on the specifics of the data and the required precision of the problem, but with a range of 26.1 (in `mpg`), an RMSE of 2.007913 is considered relatively low. Therefore, our model has good predictive accuracy.
### Variable Importance
Variable importance scores tell us how influential each variable is to contributing to the model's predictive performance.
The following code block produces a bar chart plotting the importance scores of each variable used in our final random forest model.
```{r}
final_auto_model %>% extract_fit_parsnip() %>%
vip() +
theme_minimal()
```
Displacement and weight have the highest importance scores, indicating that they are strong predictors within the model. This implies that engine size and a vehicle's weight influence a vehicle's fuel efficiency the most. Note that number of cylinders (which is related to an engine's size) and year (which may reflect advances in technology) are also significant contributors.
## Random Forest Checklist
1. Load necessary packages and data
2. Partition the data into testing and training sets
3. Fit the random forest model
- Initialize model with specified hyper parameters
- Define a model building workflow
- Tune and extract the best hyper parameters
- Finalize model workflow
- Fit model to entire training set
4. Develop predictions
5. Evaluate accuracy measures on the test set
6. Interpret the importance scores of predictors from the model
## References
Shailey Dash. (2022). "Decision Trees Explained - Entropy, Information Gain, Gini Index, CCP Pruning." Towards Data Science. Available at: https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c. This article provides an overview of decision trees, focusing on topics like Entropy, Information Gain, Gini Index, and CCP Pruning.
Carolina Bento. (2021). "Random Forests Algorithm Explained with a Real-Life Example and Some Python Code." Towards Data Science. Available at: https://towardsdatascience.com/random-forests-algorithm-explained-with-a-real-life-example-and-some-python-code-affbfa5a942c. This article explains the Random Forests algorithm, and includes a practical example along with Python code to demonstrate its application.
Steven Loaiza. (2020). "Entropy and Information Gain." Towards Data Science. Available at: https://towardsdatascience.com/entropy-and-information-gain-b738ca8abd2a. This source discusses concepts such as Entropy and Information gain.
Jason Brownlee. (2023). "A Gentle Introduction to k-fold Cross-Validation." Machine Learning Mastery. Available at: https://machinelearningmastery.com/k-fold-cross-validation. This article provides a comprehensive introduction to k-fold cross validation.