Chapter 6 Boosted Trees
In this chapter, you will see the boosting methodology with a focus on the Gradient Boosting Machine (GBM) algorithm, another popular tree-based ensemble method. Here you’ll learn how to train, tune and evaluate GBM models in R.
Introduction to boosting
6.1 Bagged trees vs. boosted trees
What is the main difference between bagged trees and boosted trees?
Boosted trees don’t perform as well as bagged trees
Boosted trees have fewer hyperparameters to tune than bagged trees.
Boosted trees improve the model fit by considering past fits and bagged trees do not.
Boosting is an iterative algorithm that considers past fits to improve performance.
6.2 Train a GBM model
Here you will use the gbm()
function to train a GBM classifier to predict loan default. You will train a 10,000-tree GBM on the credit_train
dataset, which is pre-loaded into your workspace.
Using such a large number of trees (10,000) is probably not optimal for a GBM model, but we will build more trees than we need and then select the optimal number of trees based on early performance-based stopping. The best GBM model will likely contain fewer trees than we started with.
For binary classification, gbm()
requires the response to be encoded as 0/1 (numeric), so we will have to convert from a “no/yes” factor to a 0/1 numeric response column.
Also, the the gbm()
function requires the user to specify a distribution argument. For a binary classification problem, you should set distribution = "bernoulli"
. The Bernoulli distribution models a binary response.
- Convert from a “no/yes” factor to a 0/1 numeric response column using the
ifelse()
function.
# Convert "yes" to 1, "no" to 0
$default <- ifelse(credit_train$default == "yes", 1, 0) credit_train
- Train a 10,000-tree GBM model.
# Train a 10000-tree GBM model
library(gbm)
set.seed(1)
<- gbm(formula = default ~ .,
credit_model distribution = "bernoulli",
data = credit_train,
n.trees = 10000)
# Print the model object
print(credit_model)
gbm(formula = default ~ ., distribution = "bernoulli", data = credit_train,
n.trees = 10000)
A gradient boosted model with bernoulli loss function.
10000 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
# summary() prints variable importance
summary(credit_model)
var rel.inf
amount amount 21.5769445
age age 11.7268280
credit_history credit_history 11.5590128
purpose purpose 11.4588910
employment_duration employment_duration 8.6462114
months_loan_duration months_loan_duration 6.5047765
checking_balance checking_balance 6.4725509
savings_balance savings_balance 5.4274855
other_credit other_credit 3.9776551
job job 3.6697821
percent_of_income percent_of_income 2.5358374
years_at_residence years_at_residence 2.4104585
housing housing 2.0978782
existing_loans_count existing_loans_count 0.6606161
dependents dependents 0.6518849
phone phone 0.6231871
Understanding GBN model output
6.3 Predict using a GBM model
The gbm
package uses a predict()
function to generate predictions from a model, similar to many other machine learning packages in R. When you see a function like predict()
that works on many different types of input (a GBM model, a RF model, a GLM model, etc), that indicates that predict()
is an “alias” for a GBM-specific version of that function. The GBM specific version of that function is predict.gbm()
, but for convenience sake, we can just use predict()
(either works).
One thing that’s particular to the predict.gbm()
however, is that you need to specify the number of trees used in the prediction. There is no default, so you have to specify this manually. For now, we can use the same number of trees that we specified when training the model, which is 10,000 (though this may not be the optimal number to use).
Another argument that you can specify is type
, which is only relevant to Bernoulli and Poisson distributed outcomes. When using Bernoulli loss, the returned value is on the log odds scale by default and for Poisson, it’s on the log scale. If instead you specify type = "response"
, then gbm converts the predicted values back to the same scale as the outcome. This will convert the predicted values into probabilities for Bernoulli and expected counts for Poisson.
- Generate predictions on the test set, using 10,000 trees.
# Since we converted the training response col, let's also convert the test response col
$default <- ifelse(credit_test$default == "yes", 1, 0)
credit_test
# Generate predictions on the test set
<- predict(object = credit_model,
preds1 newdata = credit_test,
n.trees = 10000)
- Generate predictions on the test set using
type = "response
and 10,000 trees.
# Generate predictions on the test set (scale to response)
<- predict(object = credit_model,
preds2 newdata = credit_test,
n.trees = 10000,
type = "response")
- Compare the ranges of the two sets of predictions.
# Compare the range of the two sets of predictions
range(preds1)
[1] -8.153800 3.151688
range(preds2)
[1] 0.0002875575 0.9589751924
6.4 Evaluate test set AUC
Compute test set AUC of the GBM model for the two sets of predictions. We will notice that they are the same value. That’s because AUC is a rank-based metric, so changing the actual values does not change the value of the AUC.
However, if we were to use a scale-aware metric like RMSE to evaluate performance, we would want to make sure we converted the predictions back to the original scale of the response.
The preds1
and preds2
prediction vectors from the previous exercise are pre-loaded into the workspace.
- Compute AUC of the predictions.
auc(actual = credit_test$default, predicted = preds1) #default
[1] 0.6944821
- Compute AUC of the predictions (scaled to response).
auc(actual = credit_test$default, predicted = preds2) #rescaled
[1] 0.6944821
- Notice that the AUC is the same!
GBM hyperparameters
6.5 Early stopping in GBMs
Use the gbm.perf()
function to estimate the optimal number of boosting iterations (aka n.trees
) for a GBM model object using both OOB and CV error. When you set out to train a large number of trees in a GBM (such as 10,000) and you use a validation method to determine an earlier (smaller) number of trees, then that’s called “early stopping.” The term “early stopping” is not unique to GBMs, but can describe auto-tuning the number of iterations in an iterative learning algorithm.
- The
credit_model
object is loaded in the workspace. Use thegbm.perf()
function with the “OOB” method to get the optimal number of trees based on the OOB error and store that number asntree_opt_oob
.
# Optimal ntree estimate based on OOB
<- gbm.perf(object = credit_model,
ntree_opt_oob method = "OOB",
oobag.curve = TRUE)
- Train a new GBM model, this time with cross-validation, so we can get a cross-validated estimate of the optimal number of trees.
# Train a CV GBM model
set.seed(1)
<- gbm(formula = default ~ .,
credit_model_cv distribution = "bernoulli",
data = credit_train,
n.trees = 10000,
cv.folds = 10,
n.cores = 12)
- Lastly, use the
gbm.perf()
function with the “cv” method to get the optimal number of trees based on the CV error and store that number asntree_opt_cv
.
# Optimal ntree estimate based on CV
<- gbm.perf(object = credit_model_cv,
ntree_opt_cv method = "cv")
- Compare the two numbers.
# Compare the estimates
print(paste0("Optimal n.trees (OOB Estimate): ", ntree_opt_oob))
[1] "Optimal n.trees (OOB Estimate): 82"
print(paste0("Optimal n.trees (CV Estimate): ", ntree_opt_cv))
[1] "Optimal n.trees (CV Estimate): 130"
6.6 OOB vs CV-based early stopping
In the previous exercise, we used OOB error and cross-validated error to estimate the optimal number of trees in the GBM. These are two different ways to estimate the optimal number of trees, so in this exercise we will compare the performance of the models on a test set. We can use the same model object to make both of these estimates since the predict.gbm()
function allows you to use any subset of the total number of trees (in our case, the total number is 10,000).
The ntree_opt_oob
and ntree_opt_cv
objects from the previous exercise (each storing an “optimal” value for n.trees
) are loaded in the workspace.
Using the credit_model
loaded in the workspace, generate two sets of predictions:
- One using the OOB estimate if
n.trees
: 82 (stored inntree_opt_oob
)
# Generate predictions on the test set using ntree_opt_oob number of trees
<- predict(object = credit_model,
preds1 newdata = credit_test,
n.trees = ntree_opt_oob)
- And the other using the CV estimate of n.trees: 130 (stored in
ntree_opt_cv
)
# Generate predictions on the test set using ntree_opt_cv number of trees
<- predict(object = credit_model,
preds2 newdata = credit_test,
n.trees = ntree_opt_cv)
# This is not in Data Camp but needed for last Chapter Exercise
<- predict(object = credit_model,
gbm_preds newdata = credit_test,
n.trees = ntree_opt_cv)
# mean(gbm_preds)
# Generate the test set AUCs using the two sets of predictions & compare
<- auc(actual = credit_test$default, predicted = preds1) #OOB
auc1 <- auc(actual = credit_test$default, predicted = preds2) #CV
auc2
# Compare AUC
print(paste0("Test set AUC (OOB): ", auc1))
[1] "Test set AUC (OOB): 0.752261420171868"
print(paste0("Test set AUC (CV): ", auc2))
[1] "Test set AUC (CV): 0.746268656716418"
Model comparison via ROC Curve & AUC
6.7 Compare all models based on AUC
In this final exercise, we will perform a model comparison across all types of models that we’ve learned about so far: Decision Trees, Bagged Trees, Random Forest and Gradient Boosting Machine (GBM). The models were all trained on the same training set, credit_train
, and predictions were made for the credit_test
dataset.
We have pre-loaded four sets of test set predictions, generated using the models we trained in previous chapters (one for each model type). The numbers stored in the prediction vectors are the raw predicted values themselves – not the predicted class labels. Using the raw predicted values, we can calculate test set AUC for each model and compare the results.
Loaded in your workspace are four numeric vectors:
dt_preds
bag_preds
rf_preds
gmb_preds
These predictions were made on credit_test
, which is also loaded into the workspace.
- Apply the
Metrics::auc()
function to each of these vectors to calculate test set AUC. Recall that the higher the AUC, the better the model.
# Generate the test set AUCs using the two sets of predictions & compare
<- credit_test$default
actual <- auc(actual = actual, predicted = dt_preds)
dt_auc <- auc(actual = actual, predicted = bag_preds)
bag_auc <- auc(actual = actual, predicted = rf_preds)
rf_auc <- auc(actual = actual, predicted = gbm_preds)
gbm_auc
# Print results
sprintf("Decision Tree Test AUC: %.3f", dt_auc)
[1] "Decision Tree Test AUC: 0.661"
sprintf("Bagged Trees Test AUC: %.3f", bag_auc)
[1] "Bagged Trees Test AUC: 0.751"
sprintf("Random Forest Test AUC: %.3f", rf_auc)
[1] "Random Forest Test AUC: 0.791"
sprintf("GBM Test AUC: %.3f", gbm_auc)
[1] "GBM Test AUC: 0.746"
In this case, we see that the Random Forest performed the best on the test set.
6.8 Plot & compare ROC curves
We conclude this course by plotting the ROC curves for all the models (one from each chapter) on the same graph. The ROCR
package provides the prediction()
and performance()
functions which generate the data required for plotting the ROC curve, given a set of predictions and actual (true) values.
The more “up and to the left” the ROC curve of a model is, the better the model. The AUC performance metric is literally the “Area Under the ROC Curve,” so the greater the area under this curve, the higher the AUC, and the better-performing the model is.
The ROCR
package can plot multiple ROC curves on the same plot if you plot several sets of predictions as a list.
- The
prediction()
function takes as input a list of prediction vectors (one per model) and a corresponding list of true values (one per model, though in our case the models were all evaluated on the same test set so they all have the same set of true values). Theprediction()
function returns a “prediction” object which is then passed to theperformance()
function.
The performance()
function generates the data necessary to plot the curve from the “prediction” object. For the ROC curve, you will also pass along two measures, "tpr"
and "fpr"
.
- Once you have the “performance” object, you can plot the ROC curves using the
plot()
method. We will add some color to the curves and a legend so we can tell which curves belong to which algorithm.
# List of predictions
<- list(dt_preds, bag_preds, rf_preds, gbm_preds)
preds_list
# List of actual values (same for all)
<- length(preds_list)
m <- rep(list(credit_test$default), m)
actuals_list
#
library(ROCR)
# Plot the ROC curves
<- prediction(preds_list, actuals_list)
pred <- performance(pred, "tpr", "fpr")
rocs plot(rocs, col = as.list(1:m), main = "Test Set ROC Curves")
legend(x = "bottomright",
legend = c("Decision Tree", "Bagged Trees", "Random Forest", "GBM"),
fill = 1:m)