Chapter 5 Random Forests

In this chapter, you will learn about the Random Forest algorithm, another tree-based ensemble method. Random Forest is a modified version of bagged trees with better performance. Here you’ll learn how to train, tune and evaluate Random Forest models in R.

Introduction to Random Forest

5.1 Bagged trees vs. Random Forest

What is the main difference between bagged trees and the Random Forest algorithm?

In Random Forest, the decision trees are trained on a random subset of the rows, but in bagging, they use all the rows.
In Random Forest, only a subset of features are selected at random at each split in a decision tree. In bagging, all features are used.
In Random Forest, there is randomness. In bagging, there is no randomness.

5.2 Train a Rondom Forest model

Here you will use the randomForest() function from the randomForest package to train a Random Forest classifier to predict loan default.

The credit_train and credit_test datasets (from Chapters 2 and 4) are already loaded in the workspace.

Use the randomForest::randomForest() function to train a Random Forest model on the credit_train dataset.
The formula used to define the model is the same as in previous chapters – we want to predict “default” as a function of all the other columns in the training set.

library(randomForest)
# Train a Random Forest
set.seed(1)  # for reproducibility
credit_model <- randomForest(formula = default ~ ., 
                             data = credit_train)

Inspect the model output.

# Print the model output                             
print(credit_model)


Call:
 randomForest(formula = default ~ ., data = credit_train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 23.62%
Confusion matrix:
     no yes class.error
no  518  48  0.08480565
yes 141  93  0.60256410

Understanding Random Forest model output

5.3 Evaluating out-of-bag error

Here you will plot the OOB error as a function of the number of trees trained, and extract the final OOB error of the Random Forest model from the trained model object.

The credit_model trained in the previous exercise is loaded in the workspace.
Get the OOB error rate for the Random Forest model.

# Grab OOB error matrix & take a look
err <- credit_model$err.rate
head(err)

           OOB        no       yes
[1,] 0.3379791 0.2272727 0.5842697
[2,] 0.3497854 0.2507553 0.5925926
[3,] 0.3484087 0.2505910 0.5862069
[4,] 0.3348083 0.2463768 0.5538462
[5,] 0.3360996 0.2367906 0.5754717
[6,] 0.3253333 0.2226415 0.5727273

# Look at final OOB error rate (last row in err matrix)
oob_err <- err[nrow(err), "OOB"]
print(oob_err)

    OOB 
0.23625

Plot the OOB error rate against the number of trees in the forest.

# Plot the model trained in the previous exercise
plot(credit_model, main = "OOB error rate versus number of trees")
# Add a legend since it doesn't have one by default
legend(x = "right", 
       legend = colnames(err),
       fill = 1:ncol(err))

5.4 Evaluating model performance on a test set

Use the caret::confusionMatrix() function to compute test set accuracy and generate a confusion matrix. Compare the test set accuracy to the OOB accuracy.

Generate class predictions for the credit_test data frame using the credit_model object.

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,   # model object 
                            newdata = credit_test,  # test dataset
                            type = "class") # return classification labels
head(class_prediction)

 1  5  8 15 16 18 
no no no no no no 
Levels: no yes

# This is not in Data Camp but needed for last Chapter Exercise
rf_preds <- predict(object = credit_model,
                    newdata = credit_test,
                    type = "prob")[, "yes"]
# mean(rf_preds)

Using the caret::confusionMatrix() function, compute the confusion matrix for the test set.

# Calculate the confusion matrix for the test set
cm <- confusionMatrix(data = class_prediction,       # predicted classes
                      reference = credit_test$default)  # actual classes
print(cm)

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  130  47
       yes   4  19
                                          
               Accuracy : 0.745           
                 95% CI : (0.6787, 0.8039)
    No Information Rate : 0.67            
    P-Value [Acc > NIR] : 0.01327         
                                          
                  Kappa : 0.3091          
                                          
 Mcnemar's Test P-Value : 4.074e-09       
                                          
            Sensitivity : 0.9701          
            Specificity : 0.2879          
         Pos Pred Value : 0.7345          
         Neg Pred Value : 0.8261          
             Prevalence : 0.6700          
         Detection Rate : 0.6500          
   Detection Prevalence : 0.8850          
      Balanced Accuracy : 0.6290          
                                          
       'Positive' Class : no

Compare the test set accuracy reported from the confusion matrix to the OOB accuracy. The OOB error is stored in oob_err, which is already in your workspace, and so OOB accuracy is just 1 - oob_err.

# Compare test set accuracy to OOB accuracy
paste0("Test Accuracy: ", cm$overall[1])

[1] "Test Accuracy: 0.745"

paste0("OOB Accuracy: ", 1 - oob_err)

[1] "OOB Accuracy: 0.76375"

OOB error vs. test set error

5.5 Advantages of OOB error

What is the main advantage of using OOB error instead of validation or test error?

Tuning the model hyperparameters using OOB error will lead to a better model.
If you evaluate your model using OOB error, then you don’t need to create a separate test set.
OOB error is more accurate than test set error.

5.6 Evaluating test set AUC

In Chapter 4, we learned about the AUC metric for evaluating binary classification models previously. In this exercise, you will compute test set AUC for the Random Forest model.

Use the predict() function with type = "prob" to generate numeric predictions on the credit_test dataset.

# Generate predictions on the test set
pred <- predict(object = credit_model,
            newdata = credit_test,
            type = "prob")
# `pred` is a matrix
class(pred)# Compute the AUC (`actual` must be a binary 1/0 numeric vector)

[1] "matrix" "votes"

# Look at the pred format
head(pred)

      no   yes
1  0.874 0.126
5  0.538 0.462
8  0.568 0.432
15 0.528 0.472
16 0.538 0.462
18 0.504 0.496

Compute the AUC using the auc() function from the Metrics package.

auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])

[1] 0.7911578

Tuning a Random Forest model

5.7 Tuning a Random Forest via `mtry`

In this exercise, you will use the randomForest::tuneRF() to tune mtry (by training several models). This function is a specific utility to tune the mtry parameter based on OOB error, which is helpful when you want a quick & easy way to tune your model. A more generic way of tuning Random Forest parameters will be presented in the following exercise.

Use the tuneRF() function in place of the randomForest() function to train a series of models with different mtry values and examine the the results. Note that (unfortunately) the tuneRF() interface does not support the typical formula input that we’ve been using, but instead uses two arguments, x (matrix or data frame of predictor variables) and y (response vector; must be a factor for classification).
The tuneRF() function has an argument, ntreeTry that defaults to 50 trees. Set nTreeTry = 500 to train a random forest model of the same size as you previously did.

# Execute the tuning process
set.seed(1)              
res <- tuneRF(x = subset(credit_train, select = -default),
              y = credit_train$default,
              ntreeTry = 500)

mtry = 4  OOB error = 23.62% 
Searching left ...
mtry = 2    OOB error = 22.88% 
0.03174603 0.05 
Searching right ...
mtry = 8    OOB error = 21.88% 
0.07407407 0.05 
mtry = 16   OOB error = 23.38% 
-0.06857143 0.05

After tuning the forest, this function will also plot model performance (OOB error) as a function of the mtry values that were evaluated (do this with print(). Keep in mind that if we want to evaluate the model based on AUC instead of error (accuracy), then this is not the best way to tune a model, as the selection only considers (OOB) error.

# Look at results
print(res)

       mtry OOBError
2.OOB     2  0.22875
4.OOB     4  0.23625
8.OOB     8  0.21875
16.OOB   16  0.23375

# Find the mtry value that minimizes OOB Error
mtry_opt <- res[,"mtry"][which.min(res[,"OOBError"])]
print(mtry_opt)

8.OOB 
    8

# If you just want to return the best RF model (rather than results)
# you can set `doBest = TRUE` in `tuneRF()` to return the best RF model
# instead of a set performance matrix.
set.seed(1) 
tuneRF(x = subset(credit_train, select = -default),
       y = credit_train$default,
       ntreeTry = 500,
       doBest = TRUE)

mtry = 4  OOB error = 23.62% 
Searching left ...
mtry = 2    OOB error = 22.88% 
0.03174603 0.05 
Searching right ...
mtry = 8    OOB error = 21.88% 
0.07407407 0.05 
mtry = 16   OOB error = 23.38% 
-0.06857143 0.05


Call:
 randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 8

        OOB estimate of  error rate: 22.5%
Confusion matrix:
     no yes class.error
no  510  56  0.09893993
yes 124 110  0.52991453

5.8 Tuning a Random Forest via tree depth

In Chapter 3, we created a manual grid of hyperparameters using the expand.grid() function and wrote code that trained and evaluated the models of the grid in a loop. In this exercise, you will create a grid of mtry, nodesize and sampsize values. In this example, we will identify the “best model” based on OOB error. The best model is defined as the model from our grid which minimizes OOB error.

Keep in mind that there are other ways to select a best model from a grid, such as choosing the best model based on validation AUC. However, for this exercise, we will use the built-in OOB error calculations instead of using a separate validation set.

Create a grid of mtry, nodesize and sampsize values.

# Establish a list of possible values for mtry, nodesize and sampsize
mtry <- seq(4, ncol(credit_train) * 0.8, 2)
nodesize <- seq(3, 8, 2)
sampsize <- nrow(credit_train) * c(0.7, 0.8)

# Create a data frame containing all combinations 
hyper_grid <- expand.grid(mtry = mtry, nodesize = nodesize, sampsize = sampsize)

Write a simple loop to train all the models and choose the best one based on OOB error.

# Create an empty vector to store OOB error values
oob_err <- c()

# Write a loop over the rows of hyper_grid to train the grid of models
for (i in 1:nrow(hyper_grid)) {

    # Train a Random Forest model
    model <- randomForest(formula = default ~ ., 
                          data = credit_train,
                          mtry = hyper_grid$mtry[i],
                          nodesize = hyper_grid$nodesize[i],
                          sampsize = hyper_grid$sampsize[i])
                          
    # Store OOB error for the model                      
    oob_err[i] <- model$err.rate[nrow(model$err.rate), "OOB"]
}

Print the set of hyperparameters which produced the best model.

# Identify optimal set of hyperparmeters based on OOB error
opt_i <- which.min(oob_err)
print(hyper_grid[opt_i,])

   mtry nodesize sampsize
21    4        5      640