Type complete sentences to answer all questions inside the answer tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R code to report numeric answers inside the answer tags (i.e. do not hard code your numeric answers).

In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication on October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables at http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by Ambler and Benner (2015) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017).

The purpose of this activity is to have the reader create several non linear regression models to predict the body fat of males. Load a cleaned version of the data available from https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv into your R session using the read.csv() function. Use the head() function to view the first six rows of the data frame bodyfatClean.

# Type your code and comments inside the code chunk

library(readr)
bodyfatClean <- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
head(bodyfatClean)

  age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1  23     154.25     67.75    36.2     93.1       85.2   94.5     59.0    37.3
2  22     173.25     72.25    38.5     93.6       83.0   98.7     58.7    37.3
3  22     154.00     66.25    34.0     95.8       87.9   99.2     59.6    38.9
4  26     184.75     72.25    37.4    101.8       86.4  101.2     60.1    37.3
5  24     184.25     71.25    34.4     97.3      100.0  101.9     63.2    42.2
6  24     210.25     74.75    39.0    104.5       94.4  107.8     66.0    42.0
  ankle_cm biceps_cm forearm_cm wrist_cm brozek_C bmi_C age_sq abdomen_wrist
1     21.9      32.0       27.4     17.1     12.6  23.6    529          68.1
2     23.4      30.5       28.9     18.2      6.9  23.3    484          64.8
3     24.0      28.8       25.2     16.6     24.6  24.7    484          71.3
4     22.8      32.4       29.4     18.2     10.9  24.9    676          68.2
5     24.0      32.2       27.7     17.7     27.8  25.5    576          82.3
6     25.6      35.7       30.6     18.8     20.5  26.5    576          75.6
        am
1 181.9365
2 169.1583
3 195.5067
4 182.7203
5 190.6993
6 190.7289

Use the glimpse() function from the dplyr package written by Wickham et al. (2020) to view the structure of bodyfatClean.

# Type your code and comments inside the code chunk
library(dplyr)
glimpse(bodyfatClean)

Rows: 251
Columns: 18
$ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight_lbs    <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height_in     <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ neck_cm       <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest_cm      <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen_cm    <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip_cm        <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh_cm      <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee_cm       <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle_cm      <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ biceps_cm     <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm_cm    <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist_cm      <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…
$ brozek_C      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1, 12.0…
$ bmi_C         <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24.5, 25…
$ age_sq        <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, 676, 7…
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64.3, 69…
$ am            <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, 190.72…

Now that you have seen the structure of the data and have studied the research question, answer the following questions.

How many observations and variables are in bodyfatClean?

# Type your code and comments inside the code chunk

dim(bodyfatClean)

[1] 251  18

There are 251 observations and 18 variables in the bodyfatClean data set.

In the regression setting, the variable that we want to predict is called the response variable. What is the name of the response variable in your case?

The name of the response variable is brozek_C.

In the regression setting, the variable(s) that we use to predict the response variable is(are) called the explanatory or predictor variable(s). How many predictor variable(s) are available to use in this data set?

There are 17 predictor variables available to use in this data set.

How many of the predictor variables are numerical and how many of them are categorical?

All of the predictor variables are numerical which means that none of them are categorical.

1 Partitioning the Data

When building a predictive model with a sufficiently large data set, it is common practice to hold out some fraction (usually less than 50%) of the data as a test set. It is difficult to provide a general rule for the size of the training and testing sets as the ideal split depends on the signal to noise ratio in the data (Hastie, Tibshirani, and Friedman 2009).

Use the creatDataPartition() function from the caret package written by Kuhn (2019) to partition the data in to training and testing.

For illustration purposes, the Boston data set from the MASS package written by Ripley (2019) is used to illustrate various steps in predictive model building. The Boston help file indicates the data set consists of 506 observations on 14 different variables for houses in Boston collected in 1978. To open the Boston help file, type ?Boston at the R prompt once the MASS package has been loaded. The goal in this example is to predict the median house price (medv) in Boston. The Boston data set is divided into a training set containing roughly 80% of the observations and a testing set containing roughly 20% of the observations. Before calling the createDataPartition() function, it is important to set a seed to ensure the data partition is reproducible.

The arguments y, p, list and times can be used with the createDataPartition() function. These arguments represent a vector of outcomes (Boston$medv), the percentage of data that goes to training (0.80), should the results be in a list (FALSE) and the number of partitions to create (1) respectively. The result from using createDataPartition() is a vector of indices one can use to create the training and testing sets.

library(caret) # load the caret package
library(MASS)  # load MASS package
set.seed(3178) # set seed for reproducibility

trainIndexB <- createDataPartition(y = Boston$medv,
                                   p = 0.80,
                                   list = FALSE,
                                   times = 1)

trainingB <- Boston[trainIndexB, ]
testingB <- Boston[-trainIndexB, ]

dim(trainingB) # Check the dimension of the  training set

[1] 407  14

dim(testingB) # Check the dimension of the testing set

[1] 99 14

Partition the data frame bodyfatClean into training and testing partitions where roughly 80% of the data is used for training and roughly 20% of the data is used for testing. To ensure reproducibility of the partition, use set.seed(314). The response variable should be brozek_C (the computed brozek based on the reported density).

# Type your code and comments inside the code chunk


library(caret)
library(MASS)
set.seed(314)

bodyfatCleanIndex <- createDataPartition(y = bodyfatClean$brozek_C,
                                         p = 0.80,
                                         list = FALSE,
                                         times = 1)
trainingBodyfatClean <- bodyfatClean[bodyfatCleanIndex, ]
testingBodyfatClean <- bodyfatClean[-bodyfatCleanIndex, ]

Use the dim() function to verify the sizes of the training and testing data sets.

# Type your code and comments inside the code chunk

dim(trainingBodyfatClean)

[1] 203  18

dim(testingBodyfatClean)

[1] 48 18

The size of the training data set is 203 and the size of the testing data set is 43.

2 Pre-Processing the Data

Some algorithms work better when the predictors are on the same scale. This section considers the preProcess() function for the caret package to find potentially helpful transformations of the training predictors. Three different transformations are considered: center, scale, and BoxCox. A center transform computes the mean of a variable and subtracts the computed mean from each value of the variable. A scale transform computes the standard deviation of a variable and divides each value of the variable by the computed standard deviation. Using both a center and a scale transform standardizes a variable. That is, using both center and scale on a variable creates a variable with a mean of 0 and a standard deviation of 1. When all values of a variable are positive, a BoxCox transform will reduce the skew of a variable, making it more Gaussian.

The R code below applies a center, scale, and BoxCox transform to all the predictors in trainingB (the training set for the Boston data) and stores the results in pp_trainingB. The computed transformations are applied to both the trainingB and the testingB data sets using the predict() function with the results stored in the objects trainingTransB and testingTransB, respectively. Note that in the Boston data set the response (medv) is the last column ($14^{\text{th}}$) of the training data frame and is removed before pre-processing with trainingB[ , -14].

# pp_trainingB --- pre-processed Training Boston
pp_trainingB <- preProcess(trainingB[ , -14],
                          method = c("center", "scale", "BoxCox"))
pp_trainingB

Created from 407 samples and 13 variables

Pre-processing:
  - Box-Cox transformation (11)
  - centered (13)
  - ignored (0)
  - scaled (13)

Lambda estimates for Box-Cox transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.00   -0.15    0.20    0.40    0.90    2.00

trainingTransB <- predict(pp_trainingB, trainingB)
testingTransB <- predict(pp_trainingB, testingB)

Your turn now to work with the bodyfatClean data frame.

Provide the column number of bodyfatClean where brozek_C is stored.

# Type your code and comments inside the code chunk

which(colnames(bodyfatClean) == "brozek_C")

[1] 14

The column number of bodyfatClean where brozek_C is stored is 14.

Use the preProcess() function to transform the predictors that are in the training data set crated in Problem 6. Specifically, pass a vector with “center”, “scale”, and “BoxCox” to the method = argument of preProcess(). Make sure not to transform the response (brozek_C). Store the results in an object named pp_training.

# Type your code and comments inside the code chunk

pp_training <- preProcess(trainingBodyfatClean[, -14],
                         method = c("center", "scale", "BoxCox"))

pp_training

Created from 203 samples and 17 variables

Pre-processing:
  - Box-Cox transformation (17)
  - centered (17)
  - ignored (0)
  - scaled (17)

Lambda estimates for Box-Cox transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.6000 -1.0000 -0.4000 -0.2706  0.3000  2.0000

Use the predict() function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans and testingTrans, respectively.

# Type your code and comments inside the code chunk

trainingTrans <- predict(pp_training, trainingBodyfatClean)
testingTrans <- predict(pp_training, testingBodyfatClean)

3 $k$-Fold Cross Validation

$k$-fold cross validation divides the data into $k$ subsets and performs the holdout method $k$ times. Specifically, one of the $k$ subsets is used as the test set and the other $k − 1$ subsets are put together to form a training set. The average MSE across all $k$ trials is computed and is denoted $CV_{(k)}$

4 Resampling with caret

The trainControl() function from caret specifies the resampling procedure to be used inside the train() function. Resampling procedures include $k$-fold cross-validation (once or repeated), leave-one-out cross-validation, and bootstrapping. The following R code creates a myControlB object that will signal a 10-fold repeated five times cross-validation scheme (50 resamples in total) to the train() function for the Boston data set. Note that the argument savePredictions = "final" saves the hold-out predictions for the optimal tuning parameters.

myControlB <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5,
                          savePredictions = "final")

Use the trainControl() function to define the resampling method (repeated cross-validation), the number of resampling iterations/folds (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl.

# Type your code and comments inside the code chunk
# Define the type of resampling

myControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5,
                          savePredictions = "final")

5 Building Non Linear Models

5.1 Regression tree model

Tree based methods for regression problems partition the predictor space into $J$ distinct and non-overlapping regions, $R_1, R_2, \ldots, R_J$. To make a prediction for a particular observation, the mean of the training observations in the region for the observation of interest is computed. Tree based methods are easy to understand and interpret; however, they also tend to overfit the training data and are not as competitive as random forests. Basic trees are introduced as a building block for random forests. The goal is to find regions $R_1, R_2, \ldots, R_J$ that minimize the RSS, denoted by

\[\begin{equation} \text{RSS} = \sum_{j = 1}^J\sum_{i \in R_j}(y_i - \hat{y}_{R_j})^2, \tag{5.1} \end{equation}\]

where $\hat{y}_{R_j}$ is the mean response for the training observations within the $j^{\text{th}}$ region. Since it is not possible to consider every possible partition of the predictor space a recursive binary splitting algorithm is generally employed. Recursive binary splitting first selects the predictor $X_j$ and the cutpoint $s$ such that splitting the predictor space into the regions $\{X|X_j <s\}$ and $\{X|X_j \ge s\}$ leads to the greatest reduction in RSS (James et al. 2017). The same process is used to find the optimal cutpoint within each region that minimizes the RSS. This process continues until a predefined stopping criteria is met. A common stopping criteria is to refuse a split when the count in the proposed region drops below a threshold such as 5 or 10. The smaller the count for a region, the more likely the tree will be overfit to the training data and perform poorly on testing data.

5.2 Tree model using caret

To fit a model with a particular algorithm, the name of the algorithm is given to the method argument of the train() function. The train() function accepts a formula interface provided the data is also specified in the function. The R code below fits a Regression Tree model by regressing medv on the predictor age in the training data set. The preferred way to train a model is by passing the response vector to the y argument and a data frame of the predictors or a matrix of the predictors to the x argument of train(). Below the train() function is used with method = "rpart" and the object myControlB is provided to the trControl argument.

5.2.1 Regression Tree with one predictor (age)

# Regression Tree with one predictor (age)
set.seed(31)
mod_TRB <- train(y = trainingTransB$medv,
                 x = data.frame(age = trainingTransB[ , 7]),
                 trControl = myControlB,
                 method = "rpart",
                 tuneLength = 10)

mod_TRB

CART 

407 samples
  1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared    MAE     
  0.002934463  9.299940  0.07891837  6.727907
  0.002983166  9.295840  0.07939126  6.722688
  0.004098588  9.200440  0.08487121  6.651356
  0.004613882  9.111639  0.09263128  6.569656
  0.004830949  9.062123  0.09769315  6.530259
  0.005080248  9.027292  0.10173868  6.504152
  0.005456624  8.947554  0.10882363  6.432420
  0.006964946  8.792057  0.12656586  6.301637
  0.034187558  8.804734  0.11621264  6.311815
  0.135909724  9.136989  0.06740295  6.603690

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.006964946.

Here, the train() function from caret computes the complexity parameter ($c_p$) that results in the smallest RMSE when using only age from trainingTransB. As the complexity parameter increases from zero, branches are pruned from the tree, reducing the complexity of the model.

Note that mod_TRB contains 10 models and the final model was chosen by the $c_p$ value of 0.0069649 corresponding to the model with minimum RMSE (8.7920571).

Once the optimal complexity parameter is determined from cross-validation, a regression tree is grown using the transformed age in trainingTransB with the R code below.

To get the final model, use the rpart() function from the rpart package as shown below. Notice that cp = mod_TRB$bestTune picks the final model from mod_TRB.

library(rpart)
set.seed(31)
mod_TRBG <- rpart(medv ~ age,
                 data = trainingTransB, cp = mod_TRB$bestTune)
mod_TRBG

n= 407 

node), split, n, deviance, yval
      * denotes terminal node

1) root 407 35195.280 22.58968  
  2) age>=0.8629704 118  9411.559 17.22458 *
  3) age< 0.8629704 289 21000.340 24.78028  
    6) age>=-1.04758 195 13941.470 23.36359 *
    7) age< -1.04758 94  5855.626 27.71915 *

Finally, to visualize the final model use the rpart.plot() function from the rpart.plot package.

library(rpart.plot)
rpart.plot(mod_TRBG)

Consider the mod_TRBG output and the resulting plot from using rpart.plot(). Note that the age has negative values. This is due to the use of transformed variables. There are n = 121 homes greater than or equal to a value of 0.86 for the variable (transformed) age. The average medv for these 121 homes is 17.10 thousand dollars. There are n = 109 homes under 0.86 for age but greater than -0.047 or more for age. These 109 homes have an average medv of 22.18 thousand dollars. There are n = 110 homes under -0.047 for age but over -1.3 for age. The average medv for the n = 110 homes is 25.22 thousand dollars. Finally, average median price of the homes under -1.3 for age is 27.94 thousand dollars.

Even though the transformed variables generally produce better models, the interpretations can be rather difficult. In such situations, it may be advisable to use untransformed variables.

# Regression Tree with untransformed variables.
set.seed(31)
mod_TRBU <- train(y = trainingB$medv,
                x = data.frame(age = trainingB[ ,7]),
                trControl = myControlB,
                method = "rpart",
                tuneLength = 10)

mod_TRBU

CART 

407 samples
  1 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared    MAE     
  0.002934463  9.297298  0.08016767  6.717037
  0.002983166  9.293197  0.08063547  6.711817
  0.004098588  9.203189  0.08532768  6.649924
  0.004613882  9.114668  0.09304447  6.566351
  0.004830949  9.066763  0.09778405  6.530999
  0.005080248  9.031949  0.10180704  6.504892
  0.005456624  8.952166  0.10893979  6.433160
  0.006964946  8.795055  0.12691561  6.300708
  0.034187558  8.809281  0.11588785  6.315299
  0.135909724  9.140502  0.06718872  6.605706

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.006964946.

library(rpart)
set.seed(31)
mod_TRGU <- rpart(medv ~ age, data = trainingB)
mod_TRGU

n= 407 

node), split, n, deviance, yval
      * denotes terminal node

1) root 407 35195.280 22.58968  
  2) age>=92.3 118  9411.559 17.22458 *
  3) age< 92.3 289 21000.340 24.78028  
    6) age>=41.3 195 13941.470 23.36359 *
    7) age< 41.3 94  5855.626 27.71915 *

library(rpart.plot)
rpart.plot(mod_TRGU)

Using the untransformed data, the explanation is simple and practical. Specifically, there are n = 121 homes greater than or equal to 92.5 years of age. The average medv for these 121 homes is 17.10 thousand dollars. There are n = 109 homes under 92.5 years of age but greater than 70.5 or more years of age. These 109 homes have an average medv of 22.18 thousand dollars. Finally, there are n = 177 homes under 70.5 years of age. The average medv for the n = 177 homes under 70.5 years of age is 26.25 thousand dollars.

Note: From here forward we will use the untransformed predictors (data) in the examples.

5.2.2 Tree model with all predictors

To get a Tree model with all the predictor variables, modify the previous R code to use the untransformed predictors as follows:

# Regression Tree with all predictors
set.seed(31)
mod_TRBall <- train(y = trainingB$medv,
                    x = trainingB[ , 1:13], # [ , -14]
                    trControl = myControlB,
                    method = "rpart",
                    tuneLength = 10)

mod_TRBall

CART 

407 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared   MAE     
  0.008491945  4.651223  0.7465801  3.208089
  0.010367693  4.739356  0.7382265  3.271009
  0.013264480  4.810732  0.7306569  3.359105
  0.018454204  4.899144  0.7185652  3.437962
  0.024923514  5.070968  0.6983436  3.603513
  0.025916903  5.082143  0.6965285  3.622541
  0.032639407  5.104272  0.6906480  3.678620
  0.067254001  5.652936  0.6309641  4.064486
  0.163979782  6.433538  0.5225446  4.721590
  0.479978887  8.571615  0.3139804  6.236646

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.008491945.

library(rpart)
set.seed(31)
mod_TRBallG <- rpart(medv ~. ,
                 data = trainingB, cp = mod_TRBall$bestTune)
mod_TRBallG

n= 407 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 407 35195.2800 22.58968  
   2) rm< 6.941 345 13806.7200 19.85855  
     4) lstat>=14.4 140  2642.5380 14.90929  
       8) nox>=0.607 83  1059.8090 12.83494  
        16) lstat>=19.645 45   389.0711 10.65556 *
        17) lstat< 19.645 38   203.8905 15.41579 *
       9) nox< 0.607 57   705.5393 17.92982 *
     5) lstat< 14.4 205  5392.8660 23.23854  
      10) lstat>=9.95 91   590.5690 20.58901 *
      11) lstat< 9.95 114  3653.5440 25.35351  
        22) age< 88.8 106  1688.1310 24.57642  
          44) rm< 6.543 69   655.7194 22.76377 *
          45) rm>=6.543 37   382.9108 27.95676 *
        23) age>=88.8 8  1053.2600 35.65000 *
   3) rm>=6.941 62  4495.5700 37.78710  
     6) rm< 7.437 36  1048.7230 32.53611 *
     7) rm>=7.437 26  1079.8230 45.05769  
      14) ptratio>=17.6 7   465.9686 38.88571 *
      15) ptratio< 17.6 19   248.9611 47.33158 *

library(rpart.plot)
rpart.plot(mod_TRBallG)

The predict() function is used to obtain the fitted/predicted values of medv using the testing data (testingB).

mod_TRBallG_pred <- predict(mod_TRBallG, newdata = testingB)

Next, the RMSE() function returns the root mean square error for the regression tree model using the testing data.

RMSE(pred = mod_TRBallG_pred, obs =testingB$medv)

[1] 4.472291

NOTE: There is a difference between the training RMSE (4.6512229) and the testing RMSE (4.4722907). What does this suggest?

Your turn now to work with the bodyfatClean data frame.

Note: Many statistical algorithms work better on transformed variables; however, the user whether a nurse, physical therapist, or physician should be able to use your proposed model without resorting to a spreadsheet or calculator.

Use the train() function with method = "rpart", tuneLength = 10 along with the myControl as the trControl to fit a regression tree named mod_TR. Use set.seed(42) for reproducibility. The goal is to predict body fat using the training data in training created in Problem 6. Use brozek_C as the response and use all the predictors. (Do not use the transformed predictors due to interpretation issues.)

# Type your code and comments inside the code chunk
# Regression Tree with all predictors

set.seed(42)
mod_TR <- train(y = trainingBodyfatClean$brozek_C,
x = trainingBodyfatClean[ , -14],
trControl = myControl,
method = "rpart",
tuneLength = 10)

mod_TR

CART 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ... 
Resampling results across tuning parameters:

  cp           RMSE      Rsquared   MAE     
  0.008671039  4.765842  0.6339009  3.898039
  0.008964455  4.778061  0.6310500  3.905037
  0.011322788  4.821892  0.6208741  3.942897
  0.016328028  4.868693  0.6157095  4.007299
  0.019278376  4.822923  0.6201090  3.953596
  0.019447925  4.821531  0.6202860  3.952612
  0.029918113  4.946138  0.5990657  3.964034
  0.036733177  4.987373  0.5916788  3.989885
  0.152240810  5.893586  0.4332022  4.740904
  0.478289933  6.719350  0.3665078  5.461712

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.008671039.

According to the output, what criterion was used to pick the best submodel? What is the value of this criterion?

RMSE was used to pick out the best submodel. The value of this criterion is 0.008775149 with an RMSE of 4.843225.

Use the rpart() function from the rpart package written by Therneau and Atkinson (2019) to build the regression tree using the complexity parameter ($c_p$) value from mod_TR above. Name this tree mod_TRG.

# Type your code and comments inside the code chunk


mod_TRG <- rpart(brozek_C ~ . -age -abdomen_cm, 
                 data = trainingBodyfatClean,
                 control = rpart.control(cp = mod_TR$bestTune))

Use the rpart.plot() function from the rpart.plot package to graph mod_TRG by Milborrow (2019) to graph mod_TRG.

# Type your code and comments inside the code chunk

rpart.plot(mod_TRG)

What predictors are used in the graph of mod_TRG?

The predictors that are used in the graph of mod_TRG are abdomen_wrist, hip_cm, weight_lbs, and height_in.

Explain the tree.

The tree has 11 nodes at the end.

Compute the RMSE for mod_TRG using the testing data set.

# Type your code and comments inside the code chunk

RMSE(pred = predict(mod_TRG, newdata = testingBodyfatClean), obs = testingBodyfatClean$brozek_C)

[1] 4.521677

5.3 Random Forest model

A random forest is a collection of decorrelated trees. A tree in random forest is constructed by considering only a random sample of the possible predictors at each split. The number of predictors considered at each split is generally the square root of the total number of predictors in the data set ($\sqrt{p}$). The data used to grow a particular tree is not the training data but a bootstrap sample of the training data. The resulting trees are decorrelated since the predictors are selected at random, have high variance and low bias. By averaging the results from many trees, the results from a random forest are less variable than a single tree, yet relatively unbiased.

To create a random forest, which is a collection of decorrelated trees, the method argument in the train() function would be one of either rf or ranger. The following example shows the R code for the random forest model to predict medv in Boston data.

set.seed(31)
mod_RF <- train(y = trainingB$medv,
                x = trainingB[, 1:13],
                trControl = myControlB,
                tuneLength = 4,
                method = "rf")

mod_RF

Random Forest 

407 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    3.664810  0.8550449  2.458839
   5    3.281327  0.8749608  2.248277
   9    3.293472  0.8710413  2.251443
  13    3.363650  0.8647896  2.299084

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 5.

Using the default arguments for rf returns an average RMSE value of 3.2813268 when the parameter mtry = 5.

Next, the predict() function is used to predict to the median house value (medv) using the testing data (testingB).

RF_pred <- predict(mod_RF, newdata = testingB)

Next, the RMSE() function is used to determine the root mean square error for the the random forest model using the testing data.

RMSE(pred = RF_pred, obs = testingB$medv)

[1] 2.889238

Use the train() function with method = "rf", tuneLength = 4 along with the myControl as the trControl to fit a random forest named mod_RF2. Use set.seed(42) for reproducibility.

# Type your code and comments inside the code chunk

set.seed(42)

mod_RF2 <- train(y = trainingBodyfatClean$brozek_C,
                 x = trainingBodyfatClean[, -14],
                 trControl = myControl,
                 tuneLength = 4,
                 method = "rf")
mod_RF2

Random Forest 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
   2    4.428938  0.6839297  3.609938
   7    4.318894  0.6934054  3.583383
  12    4.337575  0.6906621  3.597050
  17    4.347381  0.6887710  3.601198

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 7.

Use the function RMSE() in conjunction with the predict() function to find the root mean square for the testing data.

# Type your code and comments inside the code chunk

RF2_pred <- predict(mod_RF2, newdata = testingBodyfatClean)

RMSE(pred = RF2_pred, obs =testingBodyfatClean$brozek_C)

[1] 3.973305

5.4 $k$-nearest neighbors model

One of the simplest methods to provide a nonparametric estimate of f in a regression setting is using $k$-nearest neighbors ($k$-NN). $k$-NN is an algorithm that computes a weighted average of the $k$ nearest neighbors. Commonly used distance measures for continuous variables include: Euclidean, Manhattan, and Minkowski. The Euclidean, Manhattan, and Minkowski distances between two points $X = (x_1, x_2, \dots, x_n)$ and $Y = (y_1, y_2, \dots, y_n)$ are defined in (5.2), (5.3), and (5.4), respectively. Note that using p = 1 in (5.4) results in (5.3), and using p = 2 in (5.4) results in (5.2).

\[\begin{equation} D_{Euclidean}(X,Y) = \sqrt{\sum_{i = 1}^n (x_i - y_i)^2}, \tag{5.2} \end{equation}\]

\[\begin{equation} D_{Manhattan}(X,Y) = \sum_{i = 1}^n |(x_i - y_i)|, \tag{5.3} \end{equation}\]

\[\begin{equation} D_{Minkowski}(X,Y) = \left[\sum_{i = 1}^n |(x_i - y_i)|^p\right]^{1/p}, \tag{5.4} \end{equation}\]

The following R code creates a $k$-nearest neighbors model using the default arguments for method = "knn".

set.seed(31)
mod_KNN <- train(y = trainingB$medv,
                 x = trainingB[, -14],
                 trControl = myControlB,
                 tuneLength = 10,
                 method = "knn")

mod_KNN

k-Nearest Neighbors 

407 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 366, 367, 367, 365, 367, 366, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  6.322566  0.5557056  4.436682
   7  6.433252  0.5337296  4.545029
   9  6.548103  0.5116143  4.633764
  11  6.683594  0.4886915  4.721113
  13  6.799718  0.4703037  4.832851
  15  6.978860  0.4422369  4.976440
  17  7.094002  0.4247272  5.084296
  19  7.184806  0.4107144  5.159167
  21  7.290436  0.3928045  5.246419
  23  7.396043  0.3750593  5.343585

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

plot(mod_KNN)

Using the default arguments for knn returns an average RMSE value of 6.3225663 when the parameter k = 5 (number of neighbors).

The function RMSE() in conjunction with the predict() function are used to find the root mean square for the testing data.

mod_KNN_pred <- predict(mod_KNN, newdata = testingB)

RMSE(pred = mod_KNN_pred, obs =testingB$medv)

[1] 6.588785

Use the train() function with method = "knn", tuneLength = 10 along with the myControl as the trControl to fit a random forest named mod_KNN2. Use set.seed(42) for reproducibility.

# Type your code and comments inside the code chunk


set.seed(42)

mod_KNN2 <- train(y = trainingBodyfatClean$brozek_C,
                  x = trainingBodyfatClean[, -14],
                  trControl = myControl,
                  tuneLength = 10,
                  method = "knn")
mod_KNN2

k-Nearest Neighbors 

203 samples
 17 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 183, 183, 183, 182, 183, 183, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared   MAE     
   5  6.779020  0.2543281  5.328671
   7  6.834818  0.2383574  5.429607
   9  6.871380  0.2340215  5.453872
  11  6.864729  0.2334062  5.497645
  13  6.963452  0.2134425  5.577315
  15  7.035926  0.1947022  5.633690
  17  7.060784  0.1899469  5.667929
  19  7.059981  0.1953088  5.677141
  21  7.057629  0.1973002  5.680731
  23  7.076955  0.1951491  5.696571

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

For the final model, what is the value of $k$?

For the final model, the value of k = 5.

Use the function RMSE() in conjunction with the predict() function to find the root mean square for the testing data.

# Type your code and comments inside the code chunk


mod_KNN2_pred <- predict(mod_KNN2, newdata = testingBodyfatClean)

RMSE(pred = mod_KNN2_pred, obs =testingBodyfatClean$brozek_C)

[1] 5.703321

6 Comparing different models

Side by side boxplots are used to compare the MAE, RMSE, and $R^2$ values for the models mod_TRBall, mod_RF and mod_KNN.

ANS <- resamples(list(TREE = mod_TRBall, RF = mod_RF, KNN = mod_KNN))
summary(ANS)


Call:
summary.resamples(object = ANS)

Models: TREE, RF, KNN 
Number of resamples: 50 

MAE 
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
TREE 2.318344 2.898470 3.142674 3.208089 3.378693 4.544512    0
RF   1.528162 1.970816 2.234513 2.248277 2.475228 3.143329    0
KNN  3.031000 3.883723 4.375000 4.436682 4.995071 5.760000    0

RMSE 
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
TREE 3.011744 3.802089 4.454844 4.651223 5.401067 7.704795    0
RF   1.965239 2.662944 3.038454 3.281327 3.999440 6.548174    0
KNN  4.219073 5.389275 6.246657 6.322566 7.332001 8.562755    0

Rsquared 
          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
TREE 0.4035974 0.6743430 0.7851845 0.7465801 0.8405811 0.9066279    0
RF   0.5272393 0.8480704 0.8993478 0.8749608 0.9240706 0.9706998    0
KNN  0.2595814 0.4670864 0.5745526 0.5557056 0.6668533 0.7451750    0

bwplot(ANS, scales = "free")

dotplot(ANS, scales = "free")

The boxplots suggest the random forest models perform better on the training data and were more consistent (less variability) than either the KNN or Tree models.

Reproduce the above boxplot with models that you created for body fat data.

# Type your code and comments inside the code chunk

ANS2 <- resamples(list(TREE = mod_TR, RF = mod_RF2, KNN = mod_KNN2))
summary(ANS2)


Call:
summary.resamples(object = ANS2)

Models: TREE, RF, KNN 
Number of resamples: 50 

MAE 
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
TREE 2.062019 3.438930 3.831872 3.898039 4.330114 5.613670    0
RF   2.559770 3.277153 3.543786 3.583383 3.833447 4.901329    0
KNN  3.967000 4.768125 5.303500 5.328671 5.812143 6.761905    0

RMSE 
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
TREE 2.687392 4.311763 4.811417 4.765842 5.243041 6.360536    0
RF   3.297409 3.973091 4.279290 4.318894 4.674236 5.384186    0
KNN  5.208984 6.030683 6.748072 6.779020 7.326349 9.213040    0

Rsquared 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
TREE 0.30693943 0.5620572 0.6502255 0.6339009 0.7128226 0.8620405    0
RF   0.43343717 0.6384383 0.7088795 0.6934054 0.7578380 0.8414202    0
KNN  0.01639896 0.1370380 0.2071063 0.2543281 0.3362243 0.6216157    0

bwplot(ANS2, scales = "free")

This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt

References

Ambler, Gareth, and Axel Benner. 2015. Mfp: Multivariable Fractional Polynomials. https://CRAN.R-project.org/package=mfp.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. 2 edition. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2017. An Introduction to Statistical Learning: With Applications in R. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.

Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). https://doi.org/10.1080/10691898.1996.11910505.

Kuhn, Max. 2019. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Matloff, Norman. 2017. Statistical Regression and Classification: From Linear Models to Machine Learning. 1 edition. Boca Raton: Chapman; Hall/CRC.

Milborrow, Stephen. 2019. Rpart.plot: Plot ’Rpart’ Models: An Enhanced Version of ’Plot.rpart’. https://CRAN.R-project.org/package=rpart.plot.

Ripley, Brian. 2019. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. https://CRAN.R-project.org/package=MASS.

Therneau, Terry, and Beth Atkinson. 2019. Rpart: Recursive Partitioning and Regression Trees. https://CRAN.R-project.org/package=rpart.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Fitting Non Linear Regression Models to Body Fat Data

Alex Pettis

Last compiled: January 06, 2023