Chapter 4 Bagged Trees
In this chapter, you will learn about Bagged Trees, an ensemble method, that uses a combination of trees (instead of only one).
Introduction to bagged trees
4.1 Advantages of bagged trees
What are the advantages of bagged trees compared to a single tree?
Increase the accuracy of the resulting predictions
Easier to interpret the resulting model
Reduces variance by averaging a set of observations
1 and 2 are correct
1 and 3 are correct
2 and 3 are correct
4.2 Train a bagged tree model
Let’s start by training a bagged tree model. You’ll be using the bagging()
function from the ipred
package. The number of bagged trees can be specified using the nbagg
parameter, but here we will use the default (25).
If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the coob
parameter to TRUE. The OOB samples are the training observations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging()
function).
The
credit_train
andcredit_test
datasets are already loaded in the workspace.Use the
bagging()
function to train a bagged model.
library(ipred)
# Bagging is a randomized model, so let's set a seed (123) for reproducibility
set.seed(123)
# Train a bagged model
<- bagging(formula = default ~ .,
credit_model data = credit_train,
coob = TRUE)
- Inspect the model by printing it.
# Print the model
print(credit_model)
Bagging classification trees with 25 bootstrap replications
Call: bagging.data.frame(formula = default ~ ., data = credit_train,
coob = TRUE)
Out-of-bag estimate of misclassification error: 0.265
Evaluating the performance of bagged tree models
4.3 Prediction and confusion matrix
As you saw in the video, a confusion matrix is a very useful tool for examining all possible outcomes of your predictions (true positive, true negative, false positive, false negative).
In this exercise, you will predict those who will default using bagged trees. You will also create the confusion matrix using the confusionMatrix()
function from the caret
package.
It’s always good to take a look at the output using the print()
function.
The fitted model object, credit_model
, is already in your workspace.
- Use the
predict()
function withtype = "class"
to generate predicted labels on thecredit_test
dataset.
# Generate predicted classes using the model object
<- predict(object = credit_model,
class_prediction newdata = credit_test,
type = "class") # return classification labels
# This is not in Data Camp but needed for last Chapter Exercise
<- predict(object = credit_model,
bag_preds newdata = credit_test,
type = "prob")[, "yes"]
# mean(bag_preds)
- Take a look at the prediction using the
print()
function.
# Print the predicted classes
print(class_prediction)
[1] no no no yes yes no no no no no no yes no no no no no yes
[19] no no no no no no no no no yes no no no no yes no no no
[37] no no yes no no no no no no no no no no no no no yes no
[55] no no no yes no yes no yes yes no no no no yes no no no no
[73] no yes no no yes no no no no yes yes no no no no no no no
[91] no no no yes no no yes no no no no no yes no no no yes no
[109] no no no no no no yes no yes no no no no yes no no no no
[127] yes yes no no no no no no no yes no no no no no no yes no
[145] no no no no no no no no no no no no no no no yes no yes
[163] yes no no yes no no no no yes no no no no no yes no no yes
[181] no yes no no no no yes yes no yes no no no no yes no no no
[199] no no
Levels: no yes
- Calculate the confusion matrix using the
confusionMatrix()
function.
# Calculate the confusion matrix for the test set
confusionMatrix(data = class_prediction,
reference = credit_test$default)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 121 39
yes 13 27
Accuracy : 0.74
95% CI : (0.6734, 0.7993)
No Information Rate : 0.67
P-Value [Acc > NIR] : 0.0196426
Kappa : 0.3467
Mcnemar's Test P-Value : 0.0005265
Sensitivity : 0.9030
Specificity : 0.4091
Pos Pred Value : 0.7563
Neg Pred Value : 0.6750
Prevalence : 0.6700
Detection Rate : 0.6050
Detection Prevalence : 0.8000
Balanced Accuracy : 0.6560
'Positive' Class : no
4.4 Predict on a test set and compute AUC
In binary classification problems, we can predict numeric values instead of class labels. In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point.
The predicted label is generated by applying a threshold to the predicted value, such that all tests points with predicted value greater than that threshold get a predicted label of “1” and, points below that threshold get a predicted label of “0.”
In this exercise, generate predicted values (rather than class labels) on the test set and evaluate performance based on AUC (Area Under the ROC Curve). The AUC is a common metric for evaluating the discriminatory ability of a binary classification model.
- Use the
predict()
function withtype = "prob"
to generate numeric predictions on thecredit_test
dataset.
# Generate predictions on the test set
<- predict(object = credit_model,
pred newdata = credit_test,
type = "prob")
- Compute the AUC using the
auc()
function from theMetrics
package.
library(Metrics)
# `pred` is a matrix
class(pred)
[1] "matrix"
# Look at the pred format
head(pred)
no yes
[1,] 0.76 0.24
[2,] 0.68 0.32
[3,] 0.76 0.24
[4,] 0.36 0.64
[5,] 0.20 0.80
[6,] 0.72 0.28
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
[1] 0.7509611
####
#### My addition
<- auc(actual = ifelse(credit_test$default == "yes", 1, 0),
credit_ipred_model_test_auc predicted = pred[,"yes"])
credit_ipred_model_test_auc
[1] 0.7509611
Using caret for cross-validating models
4.5 Cross-validate a bagged tree model in caret
Use caret::train()
with the "treebag"
method to train a model and evaluate the model using cross-validated AUC. The caret
package allows the user to easily cross-validate any model across any relevant performance metric. In this case, we will use 5-fold cross validation and evaluate cross-validated AUC (Area Under the ROC Curve).
The credit_train
dataset is in your workspace. You will use this data frame as the training data.
- First specify a
ctrl
object, which is created using thecaret::trainControl()
function.
# Specify the training configuration
<- trainControl(method = "cv", # Cross-validation
ctrl number = 5, # 5 folds
classProbs = TRUE, # For AUC
summaryFunction = twoClassSummary) # For AUC
- In the
trainControl()
function, you can specify many things. We will set:method = "cv"
,number = 5
for 5-fold cross-validation. Also, two options that are required if you want to use AUC as the metric:classProbs = TRUE
andsummaryFunction = twoClassSummary
.
# Cross validate the credit model using "treebag" method;
# Track AUC (Area under the ROC curve)
set.seed(1) # for reproducibility
<- train(default ~ .,
credit_caret_model data = credit_train,
method = "treebag",
metric = "ROC",
trControl = ctrl)
# Look at the model object
print(credit_caret_model)
Bagged CART
800 samples
16 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 640, 640, 641, 639, 640
Resampling results:
ROC Sens Spec
0.7484483 0.8585934 0.452914
# Inspect the contents of the model list
names(credit_caret_model)
[1] "method" "modelInfo" "modelType" "results" "pred"
[6] "bestTune" "call" "dots" "metric" "control"
[11] "finalModel" "preProcess" "trainingData" "ptype" "resample"
[16] "resampledCM" "perfNames" "maximize" "yLimits" "times"
[21] "levels" "terms" "coefnames" "contrasts" "xlevels"
# Print the CV AUC
$results[,"ROC"] credit_caret_model
[1] 0.7484483
4.6 Generate predictions from the caret model
Generate predictions on a test set for the caret
model.
- First generate predictions on the
credit_test
data frame using thecredit_caret_model
object.
# Generate predictions on the test set
<- predict(object = credit_caret_model,
pred newdata = credit_test,
type = "prob")
- After generating test set predictions, use the
auc()
function from theMetrics
package to compute AUC.
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual = ifelse(credit_test$default == "yes", 1, 0),
predicted = pred[,"yes"])
[1] 0.77301
## My addition
<- auc(actual = ifelse(credit_test$default == "yes", 1, 0),
credit_caret_model_test_auc predicted = pred[,"yes"])
credit_caret_model_test_auc
[1] 0.77301
4.7 Compare test set performance to CV performance
In this exercise, you will print test set AUC estimates that you computed in previous exercises. These two methods use the same code underneath, so the estimates should be very similar.
The
credit_ipred_model_test_auc
object stores the test set AUC from the model trained using theipred::bagging()
function.The
credit_caret_model_test_auc
object stores the test set AUC from the model trained using thecaret::train()
function withmethod = "treebag"
.
Lastly, we will print the 5-fold cross-validated estimate of AUC that is stored within the credit_caret_model
object. This number will be a more accurate estimate of the true model performance since we have averaged the performance over five models instead of just one.
On small datasets like this one, the difference between test set model performance estimates and cross-validated model performance estimates will tend to be more pronounced. When using small data, it’s recommended to use cross-validated estimates of performance because they are more stable.
- Print the object
credit_ipred_model_test_auc
.
# Print ipred::bagging test set AUC estimate
print(credit_ipred_model_test_auc)
[1] 0.7509611
- Print the object
credit_caret_model_test_auc
.
# Print caret "treebag" test set AUC estimate
print(credit_caret_model_test_auc)
[1] 0.77301
- Compare these to the 5-fold cross validated AUC.
# Compare to caret 5-fold cross-validated AUC
$results[, "ROC"] credit_caret_model
[1] 0.7484483