Type complete sentences to answer all questions inside the answer
tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R
code to report numeric answers inside the answer
tags (i.e. do not hard code your numeric answers).
In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication on October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables at http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by Ambler and Benner (2015) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017).
The purpose of this activity is to have the reader create several non linear regression models to predict the body fat of males. Load a cleaned version of the data available from https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv into your R
session using the read.csv()
function. Use the head()
function to view the first six rows of the data frame bodyfatClean
.
# Type your code and comments inside the code chunk
library(readr)
<- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
bodyfatClean head(bodyfatClean)
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
2 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3
5 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2
6 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0
ankle_cm biceps_cm forearm_cm wrist_cm brozek_C bmi_C age_sq abdomen_wrist
1 21.9 32.0 27.4 17.1 12.6 23.6 529 68.1
2 23.4 30.5 28.9 18.2 6.9 23.3 484 64.8
3 24.0 28.8 25.2 16.6 24.6 24.7 484 71.3
4 22.8 32.4 29.4 18.2 10.9 24.9 676 68.2
5 24.0 32.2 27.7 17.7 27.8 25.5 576 82.3
6 25.6 35.7 30.6 18.8 20.5 26.5 576 75.6
am
1 181.9365
2 169.1583
3 195.5067
4 182.7203
5 190.6993
6 190.7289
glimpse()
function from the dplyr package written by Wickham et al. (2020) to view the structure of bodyfatClean
.# Type your code and comments inside the code chunk
library(dplyr)
glimpse(bodyfatClean)
Rows: 251
Columns: 18
$ age <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight_lbs <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height_in <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ neck_cm <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest_cm <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen_cm <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip_cm <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh_cm <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee_cm <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle_cm <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ biceps_cm <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm_cm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist_cm <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…
$ brozek_C <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1, 12.0…
$ bmi_C <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24.5, 25…
$ age_sq <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, 676, 7…
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64.3, 69…
$ am <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, 190.72…
Now that you have seen the structure of the data and have studied the research question, answer the following questions.
bodyfatClean
?# Type your code and comments inside the code chunk
dim(bodyfatClean)
[1] 251 18
There are 251 observations and 18 variables in the bodyfatClean
data set.
The name of the response variable is brozek_C
.
There are 17 predictor variables available to use in this data set.
All of the predictor variables are numerical which means that none of them are categorical.
When building a predictive model with a sufficiently large data set, it is common practice to hold out some fraction (usually less than 50%) of the data as a test set. It is difficult to provide a general rule for the size of the training
and testing
sets as the ideal split depends on the signal to noise ratio in the data (Hastie, Tibshirani, and Friedman 2009).
Use the creatDataPartition()
function from the caret package written by Kuhn (2019) to partition the data in to training
and testing
.
For illustration purposes, the Boston
data set from the MASS package written by Ripley (2019) is used to illustrate various steps in predictive model building. The Boston
help file indicates the data set consists of 506 observations on 14 different variables for houses in Boston collected in 1978. To open the Boston
help file, type ?Boston
at the R prompt once the MASS package has been loaded. The goal in this example is to predict the median house price (medv
) in Boston. The Boston data set is divided into a training set containing roughly 80% of the observations and a testing set containing roughly 20% of the observations. Before calling the createDataPartition()
function, it is important to set a seed to ensure the data partition is reproducible.
The arguments y
, p
, list
and times
can be used with the createDataPartition()
function. These arguments represent a vector of outcomes (Boston$medv
), the percentage of data that goes to training (0.80
), should the results be in a list (FALSE
) and the number of partitions to create (1
) respectively. The result from using createDataPartition()
is a vector of indices one can use to create the training and testing sets.
library(caret) # load the caret package
library(MASS) # load MASS package
set.seed(3178) # set seed for reproducibility
<- createDataPartition(y = Boston$medv,
trainIndexB p = 0.80,
list = FALSE,
times = 1)
<- Boston[trainIndexB, ]
trainingB <- Boston[-trainIndexB, ]
testingB
dim(trainingB) # Check the dimension of the training set
[1] 407 14
dim(testingB) # Check the dimension of the testing set
[1] 99 14
bodyfatClean
into training and testing partitions where roughly 80% of the data is used for training and roughly 20% of the data is used for testing. To ensure reproducibility of the partition, use set.seed(314)
. The response variable should be brozek_C
(the computed brozek based on the reported density).# Type your code and comments inside the code chunk
library(caret)
library(MASS)
set.seed(314)
<- createDataPartition(y = bodyfatClean$brozek_C,
bodyfatCleanIndex p = 0.80,
list = FALSE,
times = 1)
<- bodyfatClean[bodyfatCleanIndex, ]
trainingBodyfatClean <- bodyfatClean[-bodyfatCleanIndex, ] testingBodyfatClean
dim()
function to verify the sizes of the training
and testing
data sets.# Type your code and comments inside the code chunk
dim(trainingBodyfatClean)
[1] 203 18
dim(testingBodyfatClean)
[1] 48 18
The size of the training
data set is 203 and the size of the testing
data set is 43.
Some algorithms work better when the predictors are on the same scale. This section considers the preProcess()
function for the caret package to find potentially helpful transformations of the training predictors. Three different transformations are considered: center
, scale
, and BoxCox
. A center transform computes the mean of a variable and subtracts the computed mean from each value of the variable. A scale transform computes the standard deviation of a variable and divides each value of the variable by the computed standard deviation. Using both a center and a scale transform standardizes a variable. That is, using both center and scale on a variable creates a variable with a mean of 0 and a standard deviation of 1. When all values of a variable are positive, a BoxCox
transform will reduce the skew of a variable, making it more Gaussian.
The R
code below applies a center
, scale
, and BoxCox
transform to all the predictors in trainingB
(the training set for the Boston data) and stores the results in pp_trainingB
. The computed transformations are applied to both the trainingB
and the testingB
data sets using the predict()
function with the results stored in the objects trainingTransB
and testingTransB
, respectively. Note that in the Boston data set the response (medv
) is the last column (\(14^{\text{th}}\)) of the training data frame and is removed before pre-processing with trainingB[ , -14]
.
# pp_trainingB --- pre-processed Training Boston
<- preProcess(trainingB[ , -14],
pp_trainingB method = c("center", "scale", "BoxCox"))
pp_trainingB
Created from 407 samples and 13 variables
Pre-processing:
- Box-Cox transformation (11)
- centered (13)
- ignored (0)
- scaled (13)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00 -0.15 0.20 0.40 0.90 2.00
<- predict(pp_trainingB, trainingB)
trainingTransB <- predict(pp_trainingB, testingB) testingTransB
Your turn now to work with the bodyfatClean
data frame.
bodyfatClean
where brozek_C
is stored.# Type your code and comments inside the code chunk
which(colnames(bodyfatClean) == "brozek_C")
[1] 14
The column number of bodyfatClean
where brozek_C
is stored is 14.
preProcess()
function to transform the predictors that are in the training
data set crated in Problem 6. Specifically, pass a vector with “center”, “scale”, and “BoxCox” to the method =
argument of preProcess()
. Make sure not to transform the response (brozek_C
). Store the results in an object named pp_training
.# Type your code and comments inside the code chunk
<- preProcess(trainingBodyfatClean[, -14],
pp_training method = c("center", "scale", "BoxCox"))
pp_training
Created from 203 samples and 17 variables
Pre-processing:
- Box-Cox transformation (17)
- centered (17)
- ignored (0)
- scaled (17)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.6000 -1.0000 -0.4000 -0.2706 0.3000 2.0000
predict()
function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans
and testingTrans
, respectively.# Type your code and comments inside the code chunk
<- predict(pp_training, trainingBodyfatClean)
trainingTrans <- predict(pp_training, testingBodyfatClean) testingTrans
\(k\)-fold cross validation divides the data into \(k\) subsets and performs the holdout method \(k\) times. Specifically, one of the \(k\) subsets is used as the test set and the other \(k − 1\) subsets are put together to form a training set. The average MSE across all \(k\) trials is computed and is denoted \(CV_{(k)}\)
The trainControl()
function from caret specifies the resampling procedure to be used inside the train()
function. Resampling procedures include \(k\)-fold cross-validation (once or repeated), leave-one-out cross-validation, and bootstrapping. The following R
code creates a myControlB
object that will signal a 10-fold repeated five times cross-validation scheme (50 resamples in total) to the train()
function for the Boston
data set. Note that the argument savePredictions = "final"
saves the hold-out predictions for the optimal tuning parameters.
<- trainControl(method = "repeatedcv",
myControlB number = 10,
repeats = 5,
savePredictions = "final")
trainControl()
function to define the resampling method (repeated cross-validation), the number of resampling iterations/folds (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl
.# Type your code and comments inside the code chunk
# Define the type of resampling
<- trainControl(method = "repeatedcv",
myControl number = 10,
repeats = 5,
savePredictions = "final")
Tree based methods for regression problems partition the predictor space into \(J\) distinct and non-overlapping regions, \(R_1, R_2, \ldots, R_J\). To make a prediction for a particular observation, the mean of the training observations in the region for the observation of interest is computed. Tree based methods are easy to understand and interpret; however, they also tend to overfit the training data and are not as competitive as random forests. Basic trees are introduced as a building block for random forests. The goal is to find regions \(R_1, R_2, \ldots, R_J\) that minimize the RSS, denoted by
\[\begin{equation} \text{RSS} = \sum_{j = 1}^J\sum_{i \in R_j}(y_i - \hat{y}_{R_j})^2, \tag{5.1} \end{equation}\]
where \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j^{\text{th}}\) region. Since it is not possible to consider every possible partition of the predictor space a recursive binary splitting algorithm is generally employed. Recursive binary splitting first selects the predictor \(X_j\) and the cutpoint \(s\) such that splitting the predictor space into the regions \(\{X|X_j <s\}\) and \(\{X|X_j \ge s\}\) leads to the greatest reduction in RSS (James et al. 2017). The same process is used to find the optimal cutpoint within each region that minimizes the RSS. This process continues until a predefined stopping criteria is met. A common stopping criteria is to refuse a split when the count in the proposed region drops below a threshold such as 5 or 10. The smaller the count for a region, the more likely the tree will be overfit to the training data and perform poorly on testing data.
To fit a model with a particular algorithm, the name of the algorithm is given to the method
argument of the train()
function. The train()
function accepts a formula interface provided the data is also specified in the function. The R
code below fits a Regression Tree model by regressing medv
on the predictor age
in the training
data set. The preferred way to train a model is by passing the response vector to the y
argument and a data frame of the predictors or a matrix of the predictors to the x
argument of train()
. Below the train()
function is used with method = "rpart"
and the object myControlB
is provided to the trControl
argument.
# Regression Tree with one predictor (age)
set.seed(31)
<- train(y = trainingTransB$medv,
mod_TRB x = data.frame(age = trainingTransB[ , 7]),
trControl = myControlB,
method = "rpart",
tuneLength = 10)
mod_TRB
CART
407 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.002934463 9.299940 0.07891837 6.727907
0.002983166 9.295840 0.07939126 6.722688
0.004098588 9.200440 0.08487121 6.651356
0.004613882 9.111639 0.09263128 6.569656
0.004830949 9.062123 0.09769315 6.530259
0.005080248 9.027292 0.10173868 6.504152
0.005456624 8.947554 0.10882363 6.432420
0.006964946 8.792057 0.12656586 6.301637
0.034187558 8.804734 0.11621264 6.311815
0.135909724 9.136989 0.06740295 6.603690
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.006964946.
Here, the train()
function from caret computes the complexity parameter (\(c_p\)) that results in the smallest RMSE when using only age
from trainingTransB
. As the complexity parameter increases from zero, branches are pruned from the tree, reducing the complexity of the model.
Note that mod_TRB
contains 10 models and the final model was chosen by the \(c_p\) value of 0.0069649 corresponding to the model with minimum RMSE (8.7920571).
Once the optimal complexity parameter is determined from cross-validation, a regression tree is grown using the transformed age
in trainingTransB
with the R
code below.
To get the final model, use the rpart()
function from the rpart package as shown below. Notice that cp = mod_TRB$bestTune
picks the final model from mod_TRB
.
library(rpart)
set.seed(31)
<- rpart(medv ~ age,
mod_TRBG data = trainingTransB, cp = mod_TRB$bestTune)
mod_TRBG
n= 407
node), split, n, deviance, yval
* denotes terminal node
1) root 407 35195.280 22.58968
2) age>=0.8629704 118 9411.559 17.22458 *
3) age< 0.8629704 289 21000.340 24.78028
6) age>=-1.04758 195 13941.470 23.36359 *
7) age< -1.04758 94 5855.626 27.71915 *
Finally, to visualize the final model use the rpart.plot()
function from the rpart.plot package.
library(rpart.plot)
rpart.plot(mod_TRBG)
Consider the mod_TRBG
output and the resulting plot from using rpart.plot()
. Note that the age
has negative values. This is due to the use of transformed variables. There are n = 121 homes greater than or equal to a value of 0.86 for the variable (transformed) age
. The average medv
for these 121 homes is 17.10 thousand dollars. There are n = 109 homes under 0.86 for age
but greater than -0.047 or more for age
. These 109 homes have an average medv
of 22.18 thousand dollars. There are n = 110 homes under -0.047 for age
but over -1.3 for age
. The average medv
for the n = 110 homes is 25.22 thousand dollars. Finally, average median price of the homes under -1.3 for age
is 27.94 thousand dollars.
Even though the transformed variables generally produce better models, the interpretations can be rather difficult. In such situations, it may be advisable to use untransformed variables.
# Regression Tree with untransformed variables.
set.seed(31)
<- train(y = trainingB$medv,
mod_TRBU x = data.frame(age = trainingB[ ,7]),
trControl = myControlB,
method = "rpart",
tuneLength = 10)
mod_TRBU
CART
407 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.002934463 9.297298 0.08016767 6.717037
0.002983166 9.293197 0.08063547 6.711817
0.004098588 9.203189 0.08532768 6.649924
0.004613882 9.114668 0.09304447 6.566351
0.004830949 9.066763 0.09778405 6.530999
0.005080248 9.031949 0.10180704 6.504892
0.005456624 8.952166 0.10893979 6.433160
0.006964946 8.795055 0.12691561 6.300708
0.034187558 8.809281 0.11588785 6.315299
0.135909724 9.140502 0.06718872 6.605706
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.006964946.
library(rpart)
set.seed(31)
<- rpart(medv ~ age, data = trainingB)
mod_TRGU mod_TRGU
n= 407
node), split, n, deviance, yval
* denotes terminal node
1) root 407 35195.280 22.58968
2) age>=92.3 118 9411.559 17.22458 *
3) age< 92.3 289 21000.340 24.78028
6) age>=41.3 195 13941.470 23.36359 *
7) age< 41.3 94 5855.626 27.71915 *
library(rpart.plot)
rpart.plot(mod_TRGU)
Using the untransformed data, the explanation is simple and practical. Specifically, there are n = 121 homes greater than or equal to 92.5 years of age. The average medv
for these 121 homes is 17.10 thousand dollars. There are n = 109 homes under 92.5 years of age but greater than 70.5 or more years of age. These 109 homes have an average medv
of 22.18 thousand dollars. Finally, there are n = 177 homes under 70.5 years of age. The average medv
for the n = 177 homes under 70.5 years of age is 26.25 thousand dollars.
Note: From here forward we will use the untransformed predictors (data) in the examples.
To get a Tree model with all the predictor variables, modify the previous R
code to use the untransformed predictors as follows:
# Regression Tree with all predictors
set.seed(31)
<- train(y = trainingB$medv,
mod_TRBall x = trainingB[ , 1:13], # [ , -14]
trControl = myControlB,
method = "rpart",
tuneLength = 10)
mod_TRBall
CART
407 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.008491945 4.651223 0.7465801 3.208089
0.010367693 4.739356 0.7382265 3.271009
0.013264480 4.810732 0.7306569 3.359105
0.018454204 4.899144 0.7185652 3.437962
0.024923514 5.070968 0.6983436 3.603513
0.025916903 5.082143 0.6965285 3.622541
0.032639407 5.104272 0.6906480 3.678620
0.067254001 5.652936 0.6309641 4.064486
0.163979782 6.433538 0.5225446 4.721590
0.479978887 8.571615 0.3139804 6.236646
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.008491945.
library(rpart)
set.seed(31)
<- rpart(medv ~. ,
mod_TRBallG data = trainingB, cp = mod_TRBall$bestTune)
mod_TRBallG
n= 407
node), split, n, deviance, yval
* denotes terminal node
1) root 407 35195.2800 22.58968
2) rm< 6.941 345 13806.7200 19.85855
4) lstat>=14.4 140 2642.5380 14.90929
8) nox>=0.607 83 1059.8090 12.83494
16) lstat>=19.645 45 389.0711 10.65556 *
17) lstat< 19.645 38 203.8905 15.41579 *
9) nox< 0.607 57 705.5393 17.92982 *
5) lstat< 14.4 205 5392.8660 23.23854
10) lstat>=9.95 91 590.5690 20.58901 *
11) lstat< 9.95 114 3653.5440 25.35351
22) age< 88.8 106 1688.1310 24.57642
44) rm< 6.543 69 655.7194 22.76377 *
45) rm>=6.543 37 382.9108 27.95676 *
23) age>=88.8 8 1053.2600 35.65000 *
3) rm>=6.941 62 4495.5700 37.78710
6) rm< 7.437 36 1048.7230 32.53611 *
7) rm>=7.437 26 1079.8230 45.05769
14) ptratio>=17.6 7 465.9686 38.88571 *
15) ptratio< 17.6 19 248.9611 47.33158 *
library(rpart.plot)
rpart.plot(mod_TRBallG)
The predict()
function is used to obtain the fitted/predicted values of medv
using the testing data (testingB
).
<- predict(mod_TRBallG, newdata = testingB) mod_TRBallG_pred
Next, the RMSE()
function returns the root mean square error for the regression tree model using the testing
data.
RMSE(pred = mod_TRBallG_pred, obs =testingB$medv)
[1] 4.472291
NOTE: There is a difference between the training RMSE (4.6512229) and the testing RMSE (4.4722907). What does this suggest?
Your turn now to work with the bodyfatClean
data frame.
Note: Many statistical algorithms work better on transformed variables; however, the user whether a nurse, physical therapist, or physician should be able to use your proposed model without resorting to a spreadsheet or calculator.
train()
function with method = "rpart"
, tuneLength = 10
along with the myControl
as the trControl
to fit a regression tree named mod_TR
. Use set.seed(42)
for reproducibility. The goal is to predict body fat using the training data in training
created in Problem 6. Use brozek_C
as the response and use all the predictors. (Do not use the transformed predictors due to interpretation issues.)# Type your code and comments inside the code chunk
# Regression Tree with all predictors
set.seed(42)
<- train(y = trainingBodyfatClean$brozek_C,
mod_TR x = trainingBodyfatClean[ , -14],
trControl = myControl,
method = "rpart",
tuneLength = 10)
mod_TR
CART
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.008671039 4.765842 0.6339009 3.898039
0.008964455 4.778061 0.6310500 3.905037
0.011322788 4.821892 0.6208741 3.942897
0.016328028 4.868693 0.6157095 4.007299
0.019278376 4.822923 0.6201090 3.953596
0.019447925 4.821531 0.6202860 3.952612
0.029918113 4.946138 0.5990657 3.964034
0.036733177 4.987373 0.5916788 3.989885
0.152240810 5.893586 0.4332022 4.740904
0.478289933 6.719350 0.3665078 5.461712
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.008671039.
RMSE was used to pick out the best submodel. The value of this criterion is 0.008775149 with an RMSE of 4.843225.
rpart()
function from the rpart package written by Therneau and Atkinson (2019) to build the regression tree using the complexity parameter (\(c_p\)) value from mod_TR
above. Name this tree mod_TRG
.# Type your code and comments inside the code chunk
<- rpart(brozek_C ~ . -age -abdomen_cm,
mod_TRG data = trainingBodyfatClean,
control = rpart.control(cp = mod_TR$bestTune))
rpart.plot()
function from the rpart.plot package to graph mod_TRG
by Milborrow (2019) to graph mod_TRG
.# Type your code and comments inside the code chunk
rpart.plot(mod_TRG)
mod_TRG
?The predictors that are used in the graph of mod_TRG
are abdomen_wrist
, hip_cm
, weight_lbs
, and height_in
.
The tree has 11 nodes at the end.
mod_TRG
using the testing data set.# Type your code and comments inside the code chunk
RMSE(pred = predict(mod_TRG, newdata = testingBodyfatClean), obs = testingBodyfatClean$brozek_C)
[1] 4.521677
A random forest is a collection of decorrelated trees. A tree in random forest is constructed by considering only a random sample of the possible predictors at each split. The number of predictors considered at each split is generally the square root of the total number of predictors in the data set (\(\sqrt{p}\)). The data used to grow a particular tree is not the training data but a bootstrap sample of the training data. The resulting trees are decorrelated since the predictors are selected at random, have high variance and low bias. By averaging the results from many trees, the results from a random forest are less variable than a single tree, yet relatively unbiased.
To create a random forest, which is a collection of decorrelated trees, the method argument in the train()
function would be one of either rf
or ranger
. The following example shows the R
code for the random forest model to predict medv
in Boston
data.
set.seed(31)
<- train(y = trainingB$medv,
mod_RF x = trainingB[, 1:13],
trControl = myControlB,
tuneLength = 4,
method = "rf")
mod_RF
Random Forest
407 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 3.664810 0.8550449 2.458839
5 3.281327 0.8749608 2.248277
9 3.293472 0.8710413 2.251443
13 3.363650 0.8647896 2.299084
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 5.
Using the default arguments for rf
returns an average RMSE value of 3.2813268 when the parameter mtry = 5
.
Next, the predict()
function is used to predict to the median house value (medv
) using the testing data (testingB
).
<- predict(mod_RF, newdata = testingB) RF_pred
Next, the RMSE()
function is used to determine the root mean square error for the the random forest model using the testing data.
RMSE(pred = RF_pred, obs = testingB$medv)
[1] 2.889238
train()
function with method = "rf"
, tuneLength = 4
along with the myControl
as the trControl
to fit a random forest named mod_RF2
. Use set.seed(42)
for reproducibility.# Type your code and comments inside the code chunk
set.seed(42)
<- train(y = trainingBodyfatClean$brozek_C,
mod_RF2 x = trainingBodyfatClean[, -14],
trControl = myControl,
tuneLength = 4,
method = "rf")
mod_RF2
Random Forest
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 4.428938 0.6839297 3.609938
7 4.318894 0.6934054 3.583383
12 4.337575 0.6906621 3.597050
17 4.347381 0.6887710 3.601198
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 7.
RMSE()
in conjunction with the predict()
function to find the root mean square for the testing data.# Type your code and comments inside the code chunk
<- predict(mod_RF2, newdata = testingBodyfatClean)
RF2_pred
RMSE(pred = RF2_pred, obs =testingBodyfatClean$brozek_C)
[1] 3.973305
One of the simplest methods to provide a nonparametric estimate of f in a regression setting is using \(k\)-nearest neighbors (\(k\)-NN). \(k\)-NN is an algorithm that computes a weighted average of the \(k\) nearest neighbors. Commonly used distance measures for continuous variables include: Euclidean, Manhattan, and Minkowski. The Euclidean, Manhattan, and Minkowski distances between two points \(X = (x_1, x_2, \dots, x_n)\) and \(Y = (y_1, y_2, \dots, y_n)\) are defined in (5.2), (5.3), and (5.4), respectively. Note that using p = 1 in (5.4) results in (5.3), and using p = 2 in (5.4) results in (5.2).
\[\begin{equation} D_{Euclidean}(X,Y) = \sqrt{\sum_{i = 1}^n (x_i - y_i)^2}, \tag{5.2} \end{equation}\]
\[\begin{equation} D_{Manhattan}(X,Y) = \sum_{i = 1}^n |(x_i - y_i)|, \tag{5.3} \end{equation}\]
\[\begin{equation} D_{Minkowski}(X,Y) = \left[\sum_{i = 1}^n |(x_i - y_i)|^p\right]^{1/p}, \tag{5.4} \end{equation}\]
The following R
code creates a \(k\)-nearest neighbors model using the default arguments for method = "knn"
.
set.seed(31)
<- train(y = trainingB$medv,
mod_KNN x = trainingB[, -14],
trControl = myControlB,
tuneLength = 10,
method = "knn")
mod_KNN
k-Nearest Neighbors
407 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 6.322566 0.5557056 4.436682
7 6.433252 0.5337296 4.545029
9 6.548103 0.5116143 4.633764
11 6.683594 0.4886915 4.721113
13 6.799718 0.4703037 4.832851
15 6.978860 0.4422369 4.976440
17 7.094002 0.4247272 5.084296
19 7.184806 0.4107144 5.159167
21 7.290436 0.3928045 5.246419
23 7.396043 0.3750593 5.343585
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
plot(mod_KNN)
Using the default arguments for knn
returns an average RMSE value of 6.3225663 when the parameter k = 5
(number of neighbors).
The function RMSE()
in conjunction with the predict()
function are used to find the root mean square for the testing data.
<- predict(mod_KNN, newdata = testingB)
mod_KNN_pred
RMSE(pred = mod_KNN_pred, obs =testingB$medv)
[1] 6.588785
train()
function with method = "knn"
, tuneLength = 10
along with the myControl
as the trControl
to fit a random forest named mod_KNN2
. Use set.seed(42)
for reproducibility.# Type your code and comments inside the code chunk
set.seed(42)
<- train(y = trainingBodyfatClean$brozek_C,
mod_KNN2 x = trainingBodyfatClean[, -14],
trControl = myControl,
tuneLength = 10,
method = "knn")
mod_KNN2
k-Nearest Neighbors
203 samples
17 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 6.779020 0.2543281 5.328671
7 6.834818 0.2383574 5.429607
9 6.871380 0.2340215 5.453872
11 6.864729 0.2334062 5.497645
13 6.963452 0.2134425 5.577315
15 7.035926 0.1947022 5.633690
17 7.060784 0.1899469 5.667929
19 7.059981 0.1953088 5.677141
21 7.057629 0.1973002 5.680731
23 7.076955 0.1951491 5.696571
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
For the final model, the value of k = 5.
RMSE()
in conjunction with the predict()
function to find the root mean square for the testing data.# Type your code and comments inside the code chunk
<- predict(mod_KNN2, newdata = testingBodyfatClean)
mod_KNN2_pred
RMSE(pred = mod_KNN2_pred, obs =testingBodyfatClean$brozek_C)
[1] 5.703321
Side by side boxplots are used to compare the MAE, RMSE, and \(R^2\) values for the models mod_TRBall
, mod_RF
and mod_KNN
.
<- resamples(list(TREE = mod_TRBall, RF = mod_RF, KNN = mod_KNN))
ANS summary(ANS)
Call:
summary.resamples(object = ANS)
Models: TREE, RF, KNN
Number of resamples: 50
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 2.318344 2.898470 3.142674 3.208089 3.378693 4.544512 0
RF 1.528162 1.970816 2.234513 2.248277 2.475228 3.143329 0
KNN 3.031000 3.883723 4.375000 4.436682 4.995071 5.760000 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 3.011744 3.802089 4.454844 4.651223 5.401067 7.704795 0
RF 1.965239 2.662944 3.038454 3.281327 3.999440 6.548174 0
KNN 4.219073 5.389275 6.246657 6.322566 7.332001 8.562755 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 0.4035974 0.6743430 0.7851845 0.7465801 0.8405811 0.9066279 0
RF 0.5272393 0.8480704 0.8993478 0.8749608 0.9240706 0.9706998 0
KNN 0.2595814 0.4670864 0.5745526 0.5557056 0.6668533 0.7451750 0
bwplot(ANS, scales = "free")
dotplot(ANS, scales = "free")
The boxplots suggest the random forest models perform better on the training data and were more consistent (less variability) than either the KNN or Tree models.
# Type your code and comments inside the code chunk
<- resamples(list(TREE = mod_TR, RF = mod_RF2, KNN = mod_KNN2))
ANS2 summary(ANS2)
Call:
summary.resamples(object = ANS2)
Models: TREE, RF, KNN
Number of resamples: 50
MAE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 2.062019 3.438930 3.831872 3.898039 4.330114 5.613670 0
RF 2.559770 3.277153 3.543786 3.583383 3.833447 4.901329 0
KNN 3.967000 4.768125 5.303500 5.328671 5.812143 6.761905 0
RMSE
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 2.687392 4.311763 4.811417 4.765842 5.243041 6.360536 0
RF 3.297409 3.973091 4.279290 4.318894 4.674236 5.384186 0
KNN 5.208984 6.030683 6.748072 6.779020 7.326349 9.213040 0
Rsquared
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
TREE 0.30693943 0.5620572 0.6502255 0.6339009 0.7128226 0.8620405 0
RF 0.43343717 0.6384383 0.7088795 0.6934054 0.7578380 0.8414202 0
KNN 0.01639896 0.1370380 0.2071063 0.2543281 0.3362243 0.6216157 0
bwplot(ANS2, scales = "free")
This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt