Chapter 4 Bagged Trees

In this chapter, you will learn about Bagged Trees, an ensemble method, that uses a combination of trees (instead of only one).

Introduction to bagged trees

4.1 Advantages of bagged trees

What are the advantages of bagged trees compared to a single tree?

Increase the accuracy of the resulting predictions
Easier to interpret the resulting model
Reduces variance by averaging a set of observations
1 and 2 are correct
1 and 3 are correct
2 and 3 are correct

4.2 Train a bagged tree model

Let’s start by training a bagged tree model. You’ll be using the bagging() function from the ipred package. The number of bagged trees can be specified using the nbagg parameter, but here we will use the default (25).

If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the coob parameter to TRUE. The OOB samples are the training observations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging() function).

The credit_train and credit_test datasets are already loaded in the workspace.
Use the bagging() function to train a bagged model.

library(ipred)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)

# Train a bagged model
credit_model <- bagging(formula = default ~ ., 
                        data = credit_train,
                        coob = TRUE)

Inspect the model by printing it.

# Print the model
print(credit_model)


Bagging classification trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = default ~ ., data = credit_train, 
    coob = TRUE)

Out-of-bag estimate of misclassification error:  0.265

Evaluating the performance of bagged tree models

4.3 Prediction and confusion matrix

As you saw in the video, a confusion matrix is a very useful tool for examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative).

In this exercise, you will predict those who will default using bagged trees. You will also create the confusion matrix using the confusionMatrix() function from the caret package.

It’s always good to take a look at the output using the print() function.

The fitted model object, credit_model, is already in your workspace.

Use the predict() function with type = "class" to generate predicted labels on the credit_test dataset.

# Generate predicted classes using the model object
class_prediction <- predict(object = credit_model,    
                            newdata = credit_test,  
                            type = "class")  # return classification labels
# This is not in Data Camp but needed for last Chapter Exercise
bag_preds <- predict(object = credit_model,
                    newdata = credit_test,
                    type = "prob")[, "yes"]

# mean(bag_preds)

Take a look at the prediction using the print() function.

# Print the predicted classes
print(class_prediction)

  [1] no  no  no  yes yes no  no  no  no  no  no  yes no  no  no  no  no  yes
 [19] no  no  no  no  no  no  no  no  no  yes no  no  no  no  yes no  no  no 
 [37] no  no  yes no  no  no  no  no  no  no  no  no  no  no  no  no  yes no 
 [55] no  no  no  yes no  yes no  yes yes no  no  no  no  yes no  no  no  no 
 [73] no  yes no  no  yes no  no  no  no  yes yes no  no  no  no  no  no  no 
 [91] no  no  no  yes no  no  yes no  no  no  no  no  yes no  no  no  yes no 
[109] no  no  no  no  no  no  yes no  yes no  no  no  no  yes no  no  no  no 
[127] yes yes no  no  no  no  no  no  no  yes no  no  no  no  no  no  yes no 
[145] no  no  no  no  no  no  no  no  no  no  no  no  no  no  no  yes no  yes
[163] yes no  no  yes no  no  no  no  yes no  no  no  no  no  yes no  no  yes
[181] no  yes no  no  no  no  yes yes no  yes no  no  no  no  yes no  no  no 
[199] no  no 
Levels: no yes

Calculate the confusion matrix using the confusionMatrix() function.

# Calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,       
                reference = credit_test$default)

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  121  39
       yes  13  27
                                          
               Accuracy : 0.74            
                 95% CI : (0.6734, 0.7993)
    No Information Rate : 0.67            
    P-Value [Acc > NIR] : 0.0196426       
                                          
                  Kappa : 0.3467          
                                          
 Mcnemar's Test P-Value : 0.0005265       
                                          
            Sensitivity : 0.9030          
            Specificity : 0.4091          
         Pos Pred Value : 0.7563          
         Neg Pred Value : 0.6750          
             Prevalence : 0.6700          
         Detection Rate : 0.6050          
   Detection Prevalence : 0.8000          
      Balanced Accuracy : 0.6560          
                                          
       'Positive' Class : no

4.4 Predict on a test set and compute AUC

In binary classification problems, we can predict numeric values instead of class labels. In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point.

The predicted label is generated by applying a threshold to the predicted value, such that all tests points with predicted value greater than that threshold get a predicted label of “1” and, points below that threshold get a predicted label of “0.”

In this exercise, generate predicted values (rather than class labels) on the test set and evaluate performance based on AUC (Area Under the ROC Curve). The AUC is a common metric for evaluating the discriminatory ability of a binary classification model.

Use the predict() function with type = "prob" to generate numeric predictions on the credit_test dataset.

# Generate predictions on the test set
pred <- predict(object = credit_model,
                newdata = credit_test,
                type = "prob")

Compute the AUC using the auc() function from the Metrics package.

library(Metrics)
# `pred` is a matrix
class(pred)

[1] "matrix"

# Look at the pred format
head(pred)

       no  yes
[1,] 0.76 0.24
[2,] 0.68 0.32
[3,] 0.76 0.24
[4,] 0.36 0.64
[5,] 0.20 0.80
[6,] 0.72 0.28

# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"])

[1] 0.7509611

####
#### My addition
credit_ipred_model_test_auc <- auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
    predicted = pred[,"yes"]) 
credit_ipred_model_test_auc

[1] 0.7509611

Using caret for cross-validating models

4.5 Cross-validate a bagged tree model in caret

Use caret::train() with the "treebag" method to train a model and evaluate the model using cross-validated AUC. The caret package allows the user to easily cross-validate any model across any relevant performance metric. In this case, we will use 5-fold cross validation and evaluate cross-validated AUC (Area Under the ROC Curve).

The credit_train dataset is in your workspace. You will use this data frame as the training data.

First specify a ctrl object, which is created using the caret::trainControl() function.

# Specify the training configuration
ctrl <- trainControl(method = "cv",     # Cross-validation
                     number = 5,      # 5 folds
                     classProbs = TRUE,                  # For AUC
                     summaryFunction = twoClassSummary)  # For AUC

In the trainControl() function, you can specify many things. We will set: method = "cv", number = 5 for 5-fold cross-validation. Also, two options that are required if you want to use AUC as the metric: classProbs = TRUE and summaryFunction = twoClassSummary.

# Cross validate the credit model using "treebag" method; 
# Track AUC (Area under the ROC curve)
set.seed(1)  # for reproducibility
credit_caret_model <- train(default ~ .,
                            data = credit_train, 
                            method = "treebag",
                            metric = "ROC",
                            trControl = ctrl)

# Look at the model object
print(credit_caret_model)

Bagged CART 

800 samples
 16 predictor
  2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 640, 640, 641, 639, 640 
Resampling results:

  ROC        Sens       Spec    
  0.7484483  0.8585934  0.452914

# Inspect the contents of the model list 
names(credit_caret_model)

 [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
 [6] "bestTune"     "call"         "dots"         "metric"       "control"     
[11] "finalModel"   "preProcess"   "trainingData" "ptype"        "resample"    
[16] "resampledCM"  "perfNames"    "maximize"     "yLimits"      "times"       
[21] "levels"       "terms"        "coefnames"    "contrasts"    "xlevels"

# Print the CV AUC
credit_caret_model$results[,"ROC"]

[1] 0.7484483

4.6 Generate predictions from the caret model

Generate predictions on a test set for the caret model.

First generate predictions on the credit_test data frame using the credit_caret_model object.

# Generate predictions on the test set
pred <- predict(object = credit_caret_model, 
                newdata = credit_test,
                type = "prob")

After generating test set predictions, use the auc() function from the Metrics package to compute AUC.

# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                    predicted = pred[,"yes"])

[1] 0.77301

## My addition
credit_caret_model_test_auc <- auc(actual = ifelse(credit_test$default == "yes", 1, 0), 
                    predicted = pred[,"yes"])
credit_caret_model_test_auc

[1] 0.77301

4.7 Compare test set performance to CV performance

In this exercise, you will print test set AUC estimates that you computed in previous exercises. These two methods use the same code underneath, so the estimates should be very similar.

The credit_ipred_model_test_auc object stores the test set AUC from the model trained using the ipred::bagging() function.
The credit_caret_model_test_auc object stores the test set AUC from the model trained using the caret::train() function with method = "treebag".

Lastly, we will print the 5-fold cross-validated estimate of AUC that is stored within the credit_caret_model object. This number will be a more accurate estimate of the true model performance since we have averaged the performance over five models instead of just one.

On small datasets like this one, the difference between test set model performance estimates and cross-validated model performance estimates will tend to be more pronounced. When using small data, it’s recommended to use cross-validated estimates of performance because they are more stable.

Print the object credit_ipred_model_test_auc.

# Print ipred::bagging test set AUC estimate
print(credit_ipred_model_test_auc)

[1] 0.7509611

Print the object credit_caret_model_test_auc.

# Print caret "treebag" test set AUC estimate
print(credit_caret_model_test_auc)

[1] 0.77301

Compare these to the 5-fold cross validated AUC.

# Compare to caret 5-fold cross-validated AUC
credit_caret_model$results[, "ROC"]

[1] 0.7484483