Chapter 5 Random Forests
In this chapter, you will learn about the Random Forest algorithm, another tree-based ensemble method. Random Forest is a modified version of bagged trees with better performance. Here you’ll learn how to train, tune and evaluate Random Forest models in R.
Introduction to Random Forest
5.1 Bagged trees vs. Random Forest
What is the main difference between bagged trees and the Random Forest algorithm?
In Random Forest, the decision trees are trained on a random subset of the rows, but in bagging, they use all the rows.
In Random Forest, only a subset of features are selected at random at each split in a decision tree. In bagging, all features are used.
In Random Forest, there is randomness. In bagging, there is no randomness.
5.2 Train a Rondom Forest model
Here you will use the randomForest()
function from the randomForest
package to train a Random Forest classifier to predict loan default.
The credit_train and credit_test datasets (from Chapters 2 and 4) are already loaded in the workspace.
Use the
randomForest::randomForest()
function to train a Random Forest model on thecredit_train
dataset.The formula used to define the model is the same as in previous chapters – we want to predict “default” as a function of all the other columns in the training set.
library(randomForest)
# Train a Random Forest
set.seed(1) # for reproducibility
<- randomForest(formula = default ~ .,
credit_model data = credit_train)
- Inspect the model output.
# Print the model output
print(credit_model)
Call:
randomForest(formula = default ~ ., data = credit_train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 23.62%
Confusion matrix:
no yes class.error
no 518 48 0.08480565
yes 141 93 0.60256410
Understanding Random Forest model output
5.3 Evaluating out-of-bag error
Here you will plot the OOB error as a function of the number of trees trained, and extract the final OOB error of the Random Forest model from the trained model object.
The
credit_model
trained in the previous exercise is loaded in the workspace.Get the OOB error rate for the Random Forest model.
# Grab OOB error matrix & take a look
<- credit_model$err.rate
err head(err)
OOB no yes
[1,] 0.3379791 0.2272727 0.5842697
[2,] 0.3497854 0.2507553 0.5925926
[3,] 0.3484087 0.2505910 0.5862069
[4,] 0.3348083 0.2463768 0.5538462
[5,] 0.3360996 0.2367906 0.5754717
[6,] 0.3253333 0.2226415 0.5727273
# Look at final OOB error rate (last row in err matrix)
<- err[nrow(err), "OOB"]
oob_err print(oob_err)
OOB
0.23625
- Plot the OOB error rate against the number of trees in the forest.
# Plot the model trained in the previous exercise
plot(credit_model, main = "OOB error rate versus number of trees")
# Add a legend since it doesn't have one by default
legend(x = "right",
legend = colnames(err),
fill = 1:ncol(err))
5.4 Evaluating model performance on a test set
Use the caret::confusionMatrix()
function to compute test set accuracy and generate a confusion matrix. Compare the test set accuracy to the OOB accuracy.
- Generate class predictions for the
credit_test
data frame using thecredit_model
object.
# Generate predicted classes using the model object
<- predict(object = credit_model, # model object
class_prediction newdata = credit_test, # test dataset
type = "class") # return classification labels
head(class_prediction)
1 5 8 15 16 18
no no no no no no
Levels: no yes
# This is not in Data Camp but needed for last Chapter Exercise
<- predict(object = credit_model,
rf_preds newdata = credit_test,
type = "prob")[, "yes"]
# mean(rf_preds)
- Using the
caret::confusionMatrix()
function, compute the confusion matrix for the test set.
# Calculate the confusion matrix for the test set
<- confusionMatrix(data = class_prediction, # predicted classes
cm reference = credit_test$default) # actual classes
print(cm)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 130 47
yes 4 19
Accuracy : 0.745
95% CI : (0.6787, 0.8039)
No Information Rate : 0.67
P-Value [Acc > NIR] : 0.01327
Kappa : 0.3091
Mcnemar's Test P-Value : 4.074e-09
Sensitivity : 0.9701
Specificity : 0.2879
Pos Pred Value : 0.7345
Neg Pred Value : 0.8261
Prevalence : 0.6700
Detection Rate : 0.6500
Detection Prevalence : 0.8850
Balanced Accuracy : 0.6290
'Positive' Class : no
- Compare the test set accuracy reported from the confusion matrix to the OOB accuracy. The OOB error is stored in
oob_err
, which is already in your workspace, and so OOB accuracy is just1 - oob_err
.
# Compare test set accuracy to OOB accuracy
paste0("Test Accuracy: ", cm$overall[1])
[1] "Test Accuracy: 0.745"
paste0("OOB Accuracy: ", 1 - oob_err)
[1] "OOB Accuracy: 0.76375"
OOB error vs. test set error
5.5 Advantages of OOB error
What is the main advantage of using OOB error instead of validation or test error?
Tuning the model hyperparameters using OOB error will lead to a better model.
If you evaluate your model using OOB error, then you don’t need to create a separate test set.
OOB error is more accurate than test set error.
5.6 Evaluating test set AUC
In Chapter 4, we learned about the AUC metric for evaluating binary classification models previously. In this exercise, you will compute test set AUC for the Random Forest model.
- Use the
predict()
function withtype = "prob"
to generate numeric predictions on thecredit_test
dataset.
# Generate predictions on the test set
<- predict(object = credit_model,
pred newdata = credit_test,
type = "prob")
# `pred` is a matrix
class(pred)# Compute the AUC (`actual` must be a binary 1/0 numeric vector)
[1] "matrix" "votes"
# Look at the pred format
head(pred)
no yes
1 0.874 0.126
5 0.538 0.462
8 0.568 0.432
15 0.528 0.472
16 0.538 0.462
18 0.504 0.496
- Compute the AUC using the
auc()
function from theMetrics
package.
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
[1] 0.7911578
Tuning a Random Forest model
5.7 Tuning a Random Forest via mtry
In this exercise, you will use the randomForest::tuneRF()
to tune mtry
(by training several models). This function is a specific utility to tune the mtry
parameter based on OOB error, which is helpful when you want a quick & easy way to tune your model. A more generic way of tuning Random Forest parameters will be presented in the following exercise.
Use the
tuneRF()
function in place of therandomForest()
function to train a series of models with differentmtry
values and examine the the results. Note that (unfortunately) thetuneRF()
interface does not support the typical formula input that we’ve been using, but instead uses two arguments,x
(matrix or data frame of predictor variables) andy
(response vector; must be a factor for classification).The
tuneRF()
function has an argument,ntreeTry
that defaults to 50 trees. SetnTreeTry = 500
to train a random forest model of the same size as you previously did.
# Execute the tuning process
set.seed(1)
<- tuneRF(x = subset(credit_train, select = -default),
res y = credit_train$default,
ntreeTry = 500)
mtry = 4 OOB error = 23.62%
Searching left ...
mtry = 2 OOB error = 22.88%
0.03174603 0.05
Searching right ...
mtry = 8 OOB error = 21.88%
0.07407407 0.05
mtry = 16 OOB error = 23.38%
-0.06857143 0.05
- After tuning the forest, this function will also plot model performance (OOB error) as a function of the
mtry
values that were evaluated (do this withprint()
. Keep in mind that if we want to evaluate the model based on AUC instead of error (accuracy), then this is not the best way to tune a model, as the selection only considers (OOB) error.
# Look at results
print(res)
mtry OOBError
2.OOB 2 0.22875
4.OOB 4 0.23625
8.OOB 8 0.21875
16.OOB 16 0.23375
# Find the mtry value that minimizes OOB Error
<- res[,"mtry"][which.min(res[,"OOBError"])]
mtry_opt print(mtry_opt)
8.OOB
8
# If you just want to return the best RF model (rather than results)
# you can set `doBest = TRUE` in `tuneRF()` to return the best RF model
# instead of a set performance matrix.
set.seed(1)
tuneRF(x = subset(credit_train, select = -default),
y = credit_train$default,
ntreeTry = 500,
doBest = TRUE)
mtry = 4 OOB error = 23.62%
Searching left ...
mtry = 2 OOB error = 22.88%
0.03174603 0.05
Searching right ...
mtry = 8 OOB error = 21.88%
0.07407407 0.05
mtry = 16 OOB error = 23.38%
-0.06857143 0.05
Call:
randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 8
OOB estimate of error rate: 22.5%
Confusion matrix:
no yes class.error
no 510 56 0.09893993
yes 124 110 0.52991453
5.8 Tuning a Random Forest via tree depth
In Chapter 3, we created a manual grid of hyperparameters using the expand.grid()
function and wrote code that trained and evaluated the models of the grid in a loop. In this exercise, you will create a grid of mtry, nodesize and sampsize values. In this example, we will identify the “best model” based on OOB error. The best model is defined as the model from our grid which minimizes OOB error.
Keep in mind that there are other ways to select a best model from a grid, such as choosing the best model based on validation AUC. However, for this exercise, we will use the built-in OOB error calculations instead of using a separate validation set.
- Create a grid of
mtry
,nodesize
andsampsize
values.
# Establish a list of possible values for mtry, nodesize and sampsize
<- seq(4, ncol(credit_train) * 0.8, 2)
mtry <- seq(3, 8, 2)
nodesize <- nrow(credit_train) * c(0.7, 0.8)
sampsize
# Create a data frame containing all combinations
<- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize) hyper_grid
- Write a simple loop to train all the models and choose the best one based on OOB error.
# Create an empty vector to store OOB error values
<- c()
oob_err
# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {
# Train a Random Forest model
<- randomForest(formula = default ~ .,
model data = credit_train,
mtry = hyper_grid$mtry[i],
nodesize = hyper_grid$nodesize[i],
sampsize = hyper_grid$sampsize[i])
# Store OOB error for the model
<- model$err.rate[nrow(model$err.rate), "OOB"]
oob_err[i] }
- Print the set of hyperparameters which produced the best model.
# Identify optimal set of hyperparmeters based on OOB error
<- which.min(oob_err)
opt_i print(hyper_grid[opt_i,])
mtry nodesize sampsize
21 4 5 640