Type complete sentences to answer all questions inside the answer tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R code to report numeric answers inside the answer tags (i.e. do not hard code your numeric answers).


In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication on October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables at http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by Ambler and Benner (2022) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017).

The purpose of this activity is to have the reader create several regression models to predict the body fat of males. Load a cleaned version of the data available from https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv into your R session using the read.csv() function. Use the head() function to view the first six rows of the data frame bodyfatClean.


# Type your code and comments inside the code chunk
library(readr)
bodyfatClean <- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
head(bodyfatClean)
  age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1  23     154.25     67.75    36.2     93.1       85.2   94.5     59.0    37.3
2  22     173.25     72.25    38.5     93.6       83.0   98.7     58.7    37.3
3  22     154.00     66.25    34.0     95.8       87.9   99.2     59.6    38.9
4  26     184.75     72.25    37.4    101.8       86.4  101.2     60.1    37.3
5  24     184.25     71.25    34.4     97.3      100.0  101.9     63.2    42.2
6  24     210.25     74.75    39.0    104.5       94.4  107.8     66.0    42.0
  ankle_cm biceps_cm forearm_cm wrist_cm brozek_C bmi_C age_sq abdomen_wrist
1     21.9      32.0       27.4     17.1     12.6  23.6    529          68.1
2     23.4      30.5       28.9     18.2      6.9  23.3    484          64.8
3     24.0      28.8       25.2     16.6     24.6  24.7    484          71.3
4     22.8      32.4       29.4     18.2     10.9  24.9    676          68.2
5     24.0      32.2       27.7     17.7     27.8  25.5    576          82.3
6     25.6      35.7       30.6     18.8     20.5  26.5    576          75.6
        am
1 181.9365
2 169.1583
3 195.5067
4 182.7203
5 190.6993
6 190.7289

  1. Use the glimpse() function from the dplyr package written by Wickham et al. (2022) to view the structure of bodyfatClean.
# Type your code and comments inside the code chunk
library(dplyr)
glimpse(bodyfatClean)
Rows: 251
Columns: 18
$ age           <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight_lbs    <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height_in     <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ neck_cm       <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest_cm      <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen_cm    <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip_cm        <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh_cm      <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee_cm       <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle_cm      <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ biceps_cm     <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm_cm    <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist_cm      <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…
$ brozek_C      <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1, 12.0…
$ bmi_C         <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24.5, 25…
$ age_sq        <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, 676, 7…
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64.3, 69…
$ am            <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, 190.72…

Now that you have seen the structure of the data and have studied the research question, answer the following questions.


  1. How many observations and variables are in bodyfatClean?
# Type your code and comments inside the code chunk

dim(bodyfatClean)
[1] 251  18

There are 251 observations and 18 variables in the bodyfatClean data set.


  1. In the regression setting, the variable that we want to predict is called the response variable. What is the name of the response variable in your case?

The name of the response variable is brozek_C.


  1. In the regression setting, the variable(s) that we use to predict the response variable is(are) called the explanatory or predictor variable(s). How many predictor variable(s) are available to use in this data set?

There are 17 predictor variables available to use in this data set.


  1. How many of the predictor variables are numerical and how many of them are categorical?

All of the predictor variables are numerical which means that none of them are categorical.


1 Partitioning the Data

When building a predictive model with a sufficiently large data set, it is common practice to hold out some fraction (usually less than 50%) of the data as a test set. It is difficult to provide a general rule for the size of the training and testing sets as the ideal split depends on the signal to noise ratio in the data (Hastie, Tibshirani, and Friedman 2009).

Use the creatDataPartition() function from the caret package to partition the data in to training and testing.

For illustration purposes, the Boston data set from the MASS package written by Ripley (2022) is used to illustrate various steps in predictive model building. The Boston help file indicates the data set consists of 506 observations on 14 different variables for houses in Boston collected in 1978. To open the Boston help file, type ?Boston at the R prompt once the MASS package has been loaded. The goal is to predict the median house price (medv) in Boston. The Boston data set is divided into a training set containing roughly 80% of the observations and a testing set containing roughly 20% of the observations. Before calling the createDataPartition() function, it is important to set a seed to ensure the data partition is reproducible.

The arguments y, p, list and times can be used with the createDataPartition() function. These arguments represent a vector of outcomes (Boston$medv), the percentage of data that goes to training (0.80), should the results be in a list (FALSE) and the number of partitions to create (1) respectively. The result from using createDataPartition() is a vector of indices one can use to create the training set.

library(caret) # load the caret package
library(MASS)  # load MASS package
set.seed(3178) # set seed for reproducibility

trainIndexB <- createDataPartition(y = Boston$medv,
                                   p = 0.80,
                                   list = FALSE,
                                   times = 1)

trainingB <- Boston[trainIndexB, ]
testingB <- Boston[-trainIndexB, ]

dim(trainingB) # Check the dimension of the  training set
[1] 407  14
dim(testingB) # Check the dimension of the testing set
[1] 99 14

  1. Partition the data frame bodyfatClean into training and testing partitions where roughly 80% of the data is used for training and roughly 20% of the data is used for testing. To ensure reproducibility of the partition, use set.seed(314). The response variable should be brozek_C (the computed brozek based on the reported density).
# Type your code and comments inside the code chunk

library(caret)
library(MASS)
set.seed(314)

bodyfatCleanIndex <- createDataPartition(y = bodyfatClean$brozek_C,
                                         p = 0.80,
                                         list = FALSE,
                                         times = 1)
trainingBodyfatClean <- bodyfatClean[bodyfatCleanIndex, ]
testingBodyfatClean <- bodyfatClean[-bodyfatCleanIndex, ]

  1. Use the dim() function to verify the sizes of the training and testing data sets.
# Type your code and comments inside the code chunk

dim(trainingBodyfatClean)
[1] 203  18
dim(testingBodyfatClean)
[1] 48 18

The size of the training data set is 203 and the size of the testing data set is 43.


2 Pre-Processing the Data

Some algorithms work better when the predictors are on the same scale. This section considers the preProcess() function for the caret package to find potentially helpful transformations of the training predictors. Three different transformations are considered: center, scale, and BoxCox. A center transform computes the mean of a variable and subtracts the computed mean from each value of the variable. A scale transform computes the standard deviation of a variable and divides each value of the variable by the computed standard deviation. Using both a center and a scale transform standardizes a variable. That is, using both center and scale on a variable creates a variable with a mean of 0 and a standard deviation of 1. When all values of a variable are positive, a BoxCox transform will reduce the skew of a variable, making it more Gaussian.

The following R Code applies a center, scale, and BoxCox transform to all the predictors in trainingB (the training set for the Boston data) and stores the results in pp_trainingB. The computed transformations are applied to both the trainingB and the testingB data sets using the predict() function with the results stored in the objects trainingTransB and testingTransB, respectively. Note that in the Boston data set the response (medv) is the last column (\(14^{\text{th}}\)) of the training data frame and is removed before pre-processing with trainingB[ , -14].

pp_trainingB <- preProcess(trainingB[ , -14],
                           method = c("center", "scale", "BoxCox"))
pp_trainingB
Created from 407 samples and 13 variables

Pre-processing:
  - Box-Cox transformation (11)
  - centered (13)
  - ignored (0)
  - scaled (13)

Lambda estimates for Box-Cox transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -1.00   -0.15    0.20    0.40    0.90    2.00 
trainingTransB <- predict(pp_trainingB, trainingB)
testingTransB <- predict(pp_trainingB, testingB)

Your turn now to work with the bodyfatClean data frame.


  1. Provide the column number of bodyfatClean where brozek_C is stored.
# Type your code and comment inside the code chunk

which(colnames(bodyfatClean) == "brozek_C")
[1] 14

The column number of bodyfatClean where brozek_C is stored is 14.


  1. Use the preProcess() function to transform the predictors that are in the training data set crated in Problem 6. Specifically, pass a vector with “center,” “scale,” and “BoxCox” to the method = argument of preProcess(). Make sure not to transform the response (brozek_C). Store the results in an object named pp_training.
# Type your code and comments inside the code chunk

pp_training <- preProcess(trainingBodyfatClean[, -14],
                                      method = c("center", "scale", "BoxCox"))

pp_training
Created from 203 samples and 17 variables

Pre-processing:
  - Box-Cox transformation (17)
  - centered (17)
  - ignored (0)
  - scaled (17)

Lambda estimates for Box-Cox transformation:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.6000 -1.0000 -0.4000 -0.2706  0.3000  2.0000 

  1. Use the predict() function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans and testingTrans, respectively.
# Type your code and comments inside the code chunk

trainingTrans <- predict(pp_training, trainingBodyfatClean)
testingTrans <- predict(pp_training, testingBodyfatClean)

3 \(k\)-Fold Cross Validation

\(k\)-fold cross validation divides the data into \(k\) subsets and performs the holdout method \(k\) times. Specifically, one of the \(k\) subsets is used as the test set and the other \(k - 1\) subsets are put together to form a training set. The average MSE across all \(k\) trials is computed and is denoted \(CV_{(k)}\)

4 Resampling with caret

The trainControl() function from caret specifies the resampling procedure to be used inside the train() function. Resampling procedures include \(k\)-fold cross-validation (once or repeated), leave-one-out cross-validation, and bootstrapping. The following R code creates a myControlB object that will signal a 10-fold repeated five times cross-validation scheme (50 resamples in total) to the train() function for the Boston data set. Note that the argument savePredictions = "final" saves the hold-out predictions for the optimal tuning parameters.

myControlB <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5,
                          savePredictions = "final")

  1. Use the trainControl() function to define the resampling method (repeated cross-validation), the number of resampling iterations (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl.
# Type your code and comments inside the code chunk
# Define the type of resampling

myControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5,
                          savePredictions = "final")

5 Building Linear Models

5.1 Least squares model

Let’s denote the response variable by \(Y\) (always quantitative). \(p\) predictor variables will be denoted by \(X_1, X_2,\ldots, X_p\) (all quantitative). Some authors refer to the response variable as the dependent variable while the predictor variables are sometimes referred to as independent variables, explanatory variables, features or covariates. The relationship between \(Y\) and \(X_1, X_2,\ldots, X_p\) will be expressed as

\[\begin{equation} Y = f(X_1, X_2,\ldots, X_p) + \epsilon. \tag{5.1} \end{equation}\]

In (5.1) \(f\) is some fixed but unknown function of \(X_1, X_2,\ldots, X_p\), and \(\epsilon\) is a random error term independent of \(X_1, X_2,\ldots, X_p\) with mean zero. For predictive models, the goal is to estimate \(f\) with \(\hat{f}\) so that \(\hat{Y} = \hat{f}(X_1, X_2,\ldots, X_p)\) yields accurate predictions for \(Y\). If one assumes \(f\) is linear in \(X_1, X_2,\ldots, X_p\), the model is expressed as the linear model as shown in (5.2).

\[\begin{equation} Y = f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon \tag{5.2} \end{equation}\]

By assuming the functional form of \(f\) is linear, the problem becomes one of finding estimates for the parameters \(\beta_0, \beta_1, \ldots, \beta_p\) which is generally done with least squares. The least squares estimators are obtained by minimizing the sum of squared residuals (RSS), where

\[\begin{equation} \text{RSS} = \sum_{i=1}^n(y_i-\hat{y}_i)^2, \tag{5.3} \end{equation}\]

and \(\hat{y}_i = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2 + \cdots + \hat\beta_pX_p.\)

To fit a model with a particular algorithm, the name of the algorithm is given to the method argument of the train() function. The train() function accepts a formula interface provided the data is also specified in the function. Following R Code fits a linear model by regressing medv on all of the predictors in the training data set using the dot indicator which selects all predictors (medv ∼ .). The preferred way to train a model is by passing the response vector to the y argument and a data frame of the predictors or a matrix of the predictors to the x argument of train(). Both approaches are shown in the below R code.

# Train linear model with  `method = "lm"` to fit a Least Squares Regression model

# Approach 1
set.seed(31)
mod_lm <- train(medv ~ .,
                data = trainingTransB,
                trControl = myControlB,
                method = "lm")

# Approach 2
set.seed(31)
mod_lm2 <- train(y = trainingTransB$medv,    # response
                 x = trainingTransB[, -14],  # predictors
                 trControl = myControlB,
                 method = "lm")

mod_lm2$results       # CV results
  intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
1      TRUE 4.513069 0.7653955 3.328636 0.8140112 0.07911475 0.472427
mod_lm2$results$RMSE  # RMSE
[1] 4.513069
# Approach 3
# Should give the same results....but no???
set.seed(31)
mod_lm3 <- train(y = trainingB$medv,
                 x = trainingB[, -14],
                 # data = trainingB,
                 trControl = myControlB,
                 preProcess = c("center", "scale", "BoxCox"),
                 method = "lm")
mod_lm3$results
  intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      TRUE 4.508361 0.7657758 3.327152 0.8079126 0.07917714 0.4632514
#
set.seed(31)
mod_lm4 <- train(medv ~. ,
                 data = trainingB,
                 trControl = myControlB,
                 preProcess = c("center", "scale", "BoxCox"),
                 method = "lm")
mod_lm4$results
  intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      TRUE 4.508361 0.7657758 3.327152 0.8079126 0.07917714 0.4632514

The average root mean squared error (RMSE) of the 50 resamples is 4.5131.

summary(mod_lm2)  # summary of lm object

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3154  -2.6423  -0.2905   2.1241  22.1259 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.5897     0.2208 102.325  < 2e-16 ***
crim          0.7459     0.6103   1.222  0.22236    
zn            0.3244     0.3195   1.015  0.31055    
indus        -0.6480     0.4265  -1.519  0.12947    
chas          0.5081     0.2285   2.224  0.02671 *  
nox          -2.3616     0.5841  -4.043 6.35e-05 ***
rm            1.8848     0.3224   5.846 1.06e-08 ***
age           0.1898     0.4186   0.453  0.65048    
dis          -3.8574     0.5130  -7.519 3.77e-13 ***
rad           1.3638     0.4755   2.868  0.00435 ** 
tax          -1.9369     0.4335  -4.469 1.03e-05 ***
ptratio      -1.6714     0.2847  -5.871 9.23e-09 ***
black         0.7120     0.2642   2.694  0.00735 ** 
lstat        -5.4608     0.4089 -13.356  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.454 on 393 degrees of freedom
Multiple R-squared:  0.7785,    Adjusted R-squared:  0.7712 
F-statistic: 106.3 on 13 and 393 DF,  p-value: < 2.2e-16

  1. Use the train() function with method = "lm" and assign the object myControl to the trControl argument of the train() function to fit a least squares regression model where the goal is to predict body fat using the pre-processed data in trainingTrans created in Problem 10. Use brozek_C as the response and store the results of train() in mod_lm. Use set.seed(42) for reproducibility. Use mod_lm$results to investigate the model with minimum RMSE. Use the summary() function to view the coefficient estimates of the final model.
# Type your code and comments inside the code chunk


set.seed(42)
mod_lm <- train(brozek_C ~., 
                data = trainingTrans,
                trControl = myControl,
                method = "lm")


mod_lm$results
  intercept     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      TRUE 4.069335 0.7276702 3.375117 0.5279073 0.08777636 0.4501699
summary(mod_lm)

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6677 -2.8107 -0.5457  2.8280  9.6054 

Coefficients: (1 not defined because of singularities)
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    19.060591   0.275500  69.186   <2e-16 ***
age             0.586456   0.425226   1.379   0.1695    
weight_lbs      0.654684   3.765901   0.174   0.8622    
height_in      -4.765988   1.985589  -2.400   0.0174 *  
neck_cm        -0.786123   0.533265  -1.474   0.1421    
chest_cm       -1.859286   0.898798  -2.069   0.0400 *  
abdomen_cm    -10.158945  27.125224  -0.375   0.7084    
hip_cm         -1.587251   0.965760  -1.644   0.1020    
thigh_cm        0.033781   0.740120   0.046   0.9636    
knee_cm         0.196184   0.581074   0.338   0.7360    
ankle_cm       -0.008401   0.489140  -0.017   0.9863    
biceps_cm       0.393078   0.511790   0.768   0.4434    
forearm_cm      0.406429   0.508976   0.799   0.4256    
wrist_cm       -0.053554   2.469653  -0.022   0.9827    
bmi_C          20.221467   9.666750   2.092   0.0378 *  
age_sq                NA         NA      NA       NA    
abdomen_wrist  17.101771  25.680106   0.666   0.5063    
am            -18.865536   8.053578  -2.343   0.0202 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.925 on 186 degrees of freedom
Multiple R-squared:  0.7608,    Adjusted R-squared:  0.7402 
F-statistic: 36.97 on 16 and 186 DF,  p-value: < 2.2e-16

  1. Use the corrplot() function from the corrplot package written by Wei and Simko (2021) to identify predictors that may be linearly related in trainingTrans. Are any of the variables colinear? If so, remove the predictor that is least correlated to the response variable. Note that when method = "number" is used with corrplot(), color coded numerical correlations are displayed.
# Type your code and comments inside the code chunk
# Identifying linearly related predictors

library(corrplot)
cor <- cor(trainingTrans)
corrplot(cor, type = 'upper', method = "number")

traningTrans <- trainingTrans[-c(6,14,16) ]

Some variables that are colinear are age and age_sq as well as abdomen_cm and abdomen_wrist.


5.2 Forward selection

Forward selection starts with the null model (a model created from regressing the response on the intercept) then adds predictors one at a time to the model by regressing the response on all of the possible predictors one at a time and adding the predictor that makes the RSS as small as possible or that creates the largest \(R^2\). This process continues until all variables have been added to the model or a criterion such as a p-value greater than some predetermined value (such as 0.20) tells the algorithm to stop adding variables to the model.


  1. Use the train() function with method = "leapForward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a forward selection model where the goal is to predict body fat using the pre-processed data trainingTrans from Problem 10. Use brozek_C as the response variable and store the results of train() in mod_FS. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

    1. Use mod_FS$results to investigate the model that resulted in the minimum RMSE (final submodel).
    2. Use mod_FS$bestTune to find out how many variable were selected in the final submodel (nvmax).
    3. Use the summary(mod_FS) function and find the names of the variables selected in the final submodel with minimum RMSE. (Hint: Go to the part the output under Selection Algorithm: forward, then look at the row which agrees with the nvmax from B. The variables with "*"s are the variables selected in the final submodel.) To find the actual coefficients, one can use coef(mod_FS$finalModel, id = mod_FS$bestTune$nvmax).
# Type your code and comments inside the code chunk
# Forward Selection

set.seed(42)
mod_FS <- train(y = trainingTrans$brozek_C, 
                x = trainingTrans[, -c(6,14,16)],
                trControl = myControl,
                method = "leapForward",
                tuneLength = 10)
# A. 
mod_FS$results
   nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      2 4.026849 0.7313614 3.321276 0.5403124 0.09130784 0.4919037
2      3 4.036475 0.7297018 3.343899 0.5193163 0.08823697 0.4734673
3      4 4.052559 0.7286131 3.364212 0.5138460 0.08627183 0.4693180
4      5 4.096457 0.7224298 3.394292 0.5375727 0.09024240 0.4917228
5      6 4.092255 0.7238529 3.388433 0.5306400 0.09030599 0.4994959
6      7 4.067525 0.7273453 3.364281 0.5442005 0.09010189 0.5080472
7      8 4.099173 0.7238989 3.393444 0.5294074 0.08865337 0.5031480
8      9 4.114850 0.7214243 3.400253 0.5155552 0.09118349 0.4876967
9     10 4.103190 0.7228546 3.383874 0.5205166 0.09132070 0.4788808
10    11 4.058201 0.7289529 3.362046 0.5267089 0.08937780 0.4617132

Using mod_FS$results to find the smallest RMSE model number 1.

# B.
mod_FS$bestTune
  nvmax
1     2

Using the mod_FS$bestTune the final model used has an nvmax of 2 variables.

# C. 

summary(mod_FS)
Subset selection object
15 Variables  (and intercept)
              Forced in Forced out
age               FALSE      FALSE
weight_lbs        FALSE      FALSE
height_in         FALSE      FALSE
neck_cm           FALSE      FALSE
chest_cm          FALSE      FALSE
hip_cm            FALSE      FALSE
thigh_cm          FALSE      FALSE
knee_cm           FALSE      FALSE
ankle_cm          FALSE      FALSE
biceps_cm         FALSE      FALSE
forearm_cm        FALSE      FALSE
wrist_cm          FALSE      FALSE
bmi_C             FALSE      FALSE
abdomen_wrist     FALSE      FALSE
am                FALSE      FALSE
1 subsets of each size up to 2
Selection Algorithm: forward
         age weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1  ( 1 ) " " " "        " "       " "     " "      " "    " "      " "    
2  ( 1 ) " " " "        "*"       " "     " "      " "    " "      " "    
         ankle_cm biceps_cm forearm_cm wrist_cm bmi_C abdomen_wrist am 
1  ( 1 ) " "      " "       " "        " "      " "   "*"           " "
2  ( 1 ) " "      " "       " "        " "      " "   "*"           " "
coef(mod_FS$finalModel, id = mod_FS$bestTune$nvmax)
  (Intercept)     height_in abdomen_wrist 
    19.060591     -1.466297      6.628864 

Type your complete sentence answer here using inline R code and delete this comment.

The variables used in the final model are height_in and abdomen_wrist.


  1. Compute the RMSE for mod_FS using the testing data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set

yhat <- predict(mod_FS, testingTrans)
y <- testingTrans$brozek_C
rmse <- sqrt(mean((y - yhat)^2))
rmse
[1] 4.20519

Type your complete sentence answer here using inline R code and delete this comment.

The RMSE was 4.20519.


5.3 Backward elimination

Backward elimination starts with all predictors in the model (full model) and removes the least significant variables one at a time. The algorithm stops removing variables once a stopping criterion such as p-value less than some predetermined value (such as 0.05) for all predictors in the model is achieved. When the data has more predictors (\(p\)) than observations (\(n\)), backward elimination is no longer a viable approach since the least squares estimates are not unique. While forward selection is a viable approach when \(n < p\), constrained regression is sometimes a better choice by increasing the bias for a relatively large reduction in variance.


  1. Use the train() function with method = "leapBackward", tuneLength = 10 and assign the object myControl to the trControl argument of the train() function to fit a backward selection model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_BE. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated.

    1. Use mod_BE$results to investigate which model results in the minimum RMSE (final submodel).
    2. Use mod_BE$bestTune to find out how many variable were selected to the final submodel (nvmax)?
    3. Use the summary(mod_BE) function and find the names of the variables selected in the final submodel with minimum RMSE. (Hint: Go to the part the output under Selection Algorithm: backward, then look at the row which agrees with the nvmax from B. The variables with "*"s are the variables selected in the final submodel.) To find the actual coefficients, one can use coef(mod_BE$finalModel, id = mod_BE$bestTune$nvmax).
# Type your code and comments inside the code chunk
# Fit model with backward elimination


set.seed(42)
mod_BE <- train(y = trainingTrans$brozek_C, 
                x = trainingTrans[, -c(6,14,16)],
                trControl = myControl,
                method = "leapBackward",
                tuneLength = 10)
# A. 
mod_BE$results
   nvmax     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1      2 4.026157 0.7330041 3.319767 0.5336515 0.08900824 0.4892633
2      3 4.035584 0.7313086 3.339755 0.5119942 0.08719440 0.4708414
3      4 4.057963 0.7291062 3.380256 0.5081399 0.08448765 0.4447725
4      5 4.051860 0.7281567 3.385981 0.5183412 0.08811846 0.4654395
5      6 4.081451 0.7242761 3.403985 0.5152274 0.08793244 0.4497012
6      7 4.016873 0.7333198 3.336306 0.5292507 0.08613017 0.4448289
7      8 4.032096 0.7320207 3.347913 0.5075824 0.08491969 0.4396981
8      9 4.033619 0.7313225 3.343201 0.5141581 0.08388861 0.4406242
9     10 3.989002 0.7370281 3.308391 0.5362734 0.08656383 0.4595357
10    11 4.015134 0.7340522 3.330248 0.5276127 0.08650233 0.4549319

Using mod_FS$results to find the smallest RMSE model number 9.

# B.

mod_BE$bestTune
  nvmax
9    10

Using the mod_FS$bestTune the final model used has an nvmax of 10 variables.

# C.

summary(mod_BE)
Subset selection object
15 Variables  (and intercept)
              Forced in Forced out
age               FALSE      FALSE
weight_lbs        FALSE      FALSE
height_in         FALSE      FALSE
neck_cm           FALSE      FALSE
chest_cm          FALSE      FALSE
hip_cm            FALSE      FALSE
thigh_cm          FALSE      FALSE
knee_cm           FALSE      FALSE
ankle_cm          FALSE      FALSE
biceps_cm         FALSE      FALSE
forearm_cm        FALSE      FALSE
wrist_cm          FALSE      FALSE
bmi_C             FALSE      FALSE
abdomen_wrist     FALSE      FALSE
am                FALSE      FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
          age weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1  ( 1 )  " " " "        " "       " "     " "      " "    " "      " "    
2  ( 1 )  " " " "        "*"       " "     " "      " "    " "      " "    
3  ( 1 )  " " " "        "*"       " "     " "      " "    " "      " "    
4  ( 1 )  " " " "        "*"       " "     " "      " "    " "      " "    
5  ( 1 )  " " " "        "*"       " "     "*"      " "    " "      " "    
6  ( 1 )  " " " "        "*"       " "     "*"      "*"    " "      " "    
7  ( 1 )  " " " "        "*"       " "     "*"      "*"    " "      " "    
8  ( 1 )  " " " "        "*"       "*"     "*"      "*"    " "      " "    
9  ( 1 )  "*" " "        "*"       "*"     "*"      "*"    " "      " "    
10  ( 1 ) "*" " "        "*"       "*"     "*"      "*"    " "      " "    
          ankle_cm biceps_cm forearm_cm wrist_cm bmi_C abdomen_wrist am 
1  ( 1 )  " "      " "       " "        " "      " "   "*"           " "
2  ( 1 )  " "      " "       " "        " "      " "   "*"           " "
3  ( 1 )  " "      " "       " "        " "      " "   "*"           "*"
4  ( 1 )  " "      " "       " "        " "      "*"   "*"           "*"
5  ( 1 )  " "      " "       " "        " "      "*"   "*"           "*"
6  ( 1 )  " "      " "       " "        " "      "*"   "*"           "*"
7  ( 1 )  " "      " "       " "        "*"      "*"   "*"           "*"
8  ( 1 )  " "      " "       " "        "*"      "*"   "*"           "*"
9  ( 1 )  " "      " "       " "        "*"      "*"   "*"           "*"
10  ( 1 ) " "      "*"       " "        "*"      "*"   "*"           "*"
coef(mod_BE$finalModel, id = mod_BE$bestTune$nvmax)
  (Intercept)           age     height_in       neck_cm      chest_cm 
   19.0605911     0.5383486    -4.1470837    -0.7327354    -1.8557702 
       hip_cm     biceps_cm      wrist_cm         bmi_C abdomen_wrist 
   -1.4764111     0.5588755    -0.8520107    19.9908063     7.4474726 
           am 
  -17.9423020 

The variables used in the final model are age, height_in, neck_cm, chest_cm, hip_cm, biceps_cm, wrist_cm, bmi_C, abdomen_wrist, and am.


  1. Compute the RMSE for mod_BE using the testing data set.
# Type your code and comments inside the code chunk
# Computing RMSE on the testing set

yhat <- predict(mod_BE, testingTrans)
y <- testingTrans$brozek_C
rmse <- sqrt(mean((y - yhat)^2))
rmse
[1] 4.000992

Type your complete sentence answer here using inline R code and delete this comment.

The RMSE was 4.000992.


5.4 Constrained Regression

Constrained regression shrinks the coefficient estimates towards zero while reducing their variance by adding a small amount of bias. Two popular forms of constrained regression are ridge regression developed by Hoerl and Kennard (1970) and the least absolute shrinkage and selection operator (LASSO) developed by Tibshirani (1996).

5.4.1 Ridge Regression

The ridge regression coefficient estimates denoted by \(\hat{\beta}^R\) are values that minimize \[\begin{equation} \sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p\beta^2_j = RSS + \lambda\sum_{j=1}^p\beta^2_j, \tag{5.4} \end{equation}\]

where \(\lambda \ge 0\) is a tuning parameter. Similar to least squares, ridge regression seeks estimates that make RSS small. However, the quantity \(\lambda\sum_{j=1}^p\beta^2_j\) is small when the estimates of the \(\beta\)’s are close to zero. That is, the quantity \(\lambda\sum_{j=1}^p\beta^2_j\) is a shrinkage penalty. When \(\lambda = 0\), there is no penalty and ridge regression will return the least squares estimates. As \(\lambda \rightarrow \infty\), the shrinkage penalty grows and the estimates of the \(\beta\)’s approach zero. While least squares regression returns one set of estimates for the \(\beta\)’s, ridge regression returns a different set of estimates, \(\hat{\beta}^R\), for each value of \(\lambda\). Selecting a good value of \(\lambda\) can be accomplished with cross-validation. The final model with ridge regression will include all \(p\) predictors. The penalty used by ridge regression is known as an \(\ell_2\) norm. Using an \(\ell_1\) norm for the shrinkage penalty instead of an \(\ell_2\) norm, results in the lasso model (Tibshirani 1996).

5.4.2 LASSO

The lasso coefficients, \(\hat{\beta}^L\), minimize the quantity

\[\begin{equation} \sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p|\beta_j| = RSS + \lambda\sum_{j=1}^p|\beta_j| \tag{5.5} \end{equation}\]

As with ridge regression, lasso returns a set of coefficient estimates, \(\hat{\beta}^L\), for each value of \(\lambda\). Selecting a good value of \(\lambda\) (known as tuning) can be accomplished with cross-validation. Unlike the final ridge regression model which includes all \(p\) predictors, the lasso model performs variable selection and returns coefficient estimates for only a subset of the original predictors making the final model easier to interpret. Unfortunately, the lasso does not handle correlated variables very well (Hastie, Tibshirani, and Wainwright 2015). The elastic net (Zou and Hastie 2005) provides a compromise between ridge and lasso penalties.

5.4.3 Elastic net

The elastic net minimizes the quantity

\[\begin{equation} \frac{1}{2}\sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\left[\frac{1}{2}(1 - \alpha)||\beta||_2^2 + \alpha||\beta||_1 \right] , \tag{5.6} \end{equation}\]

where \(||\beta||_2^2\) is the \(\ell_2\) norm of \(\beta\), \(\sum_{j=1}^p\beta^2_j\), and \(||\beta||_1\) is the \(\ell_1\) norm of \(\beta\), \(\sum_{j=1}^p|\beta_j|\). The penalty applied to an individual coefficient ignoring the value of \(\lambda\) is

\[\begin{equation} \frac{1}{2}(1 - \alpha)\beta_j^2 + \alpha|\beta_j|. \tag{5.7} \end{equation}\]

When \(\alpha = 1\), the penalty in (5.7) reduces to the lasso penalty, and when \(\alpha = 0\), the penalty in (5.7) reduces to the ridge penalty. Selecting good values of \(\alpha\) and \(\lambda\) can be accomplished with cross-validation.

The following R code fits a ridge regression model for medv using the Boston data with the train() function and method = "glmnet". In the expand.grid() function, use the argument alpha = 0 to specify a ridge model. lambda = seq(0.01, 100, length = 100) creates a user defined grid of lambda values used in tuning the ridge regression.

# Fit constrained model (elastic net)---Ridge model
set.seed(42)
mod_RidgeB <- train(y =trainingTransB$medv,  # response
                x = trainingTransB[, -14],   # predictors 
                trControl = myControlB, 
                method = "glmnet",
                tuneGrid = expand.grid(alpha =0, lambda = seq(0.01, 100, length = 100)))
head(mod_RidgeB$results)
  alpha lambda     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
1     0   0.01 4.520299 0.7662803 3.240542 1.061121  0.1018991 0.5715856
2     0   1.02 4.549192 0.7640606 3.237639 1.090177  0.1054067 0.5729318
3     0   2.03 4.670858 0.7556412 3.274392 1.152884  0.1144966 0.5801409
4     0   3.04 4.796116 0.7477187 3.344541 1.181727  0.1201087 0.5831796
5     0   4.05 4.916809 0.7403484 3.422373 1.196029  0.1239887 0.5863119
6     0   5.06 5.031277 0.7333883 3.500951 1.203254  0.1268702 0.5908202
min(mod_RidgeB$results$RMSE)
[1] 4.520299
mod_RidgeB$bestTune
  alpha lambda
1     0   0.01
mod_RidgeB$bestTune$lambda
[1] 0.01
mod_RidgeB$bestTune$alpha
[1] 0

mod_Ridge$results displays the RMSE for all the models for each value of the tuning parameter lambda (100 models in this case). mod_RidgeB$bestTune gives the lambda value of the model with the minimum RMSE. In this case, the best lambda value was 0.01 which returned the minimum RMSE of 4.520299.


  1. Use the train() function with method = "glmnet" to fit a constrained linear regression model named mod_Ridge. Assign the object myControl to the trControl argument of the train() function to fit a ridge model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_Ridge. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated. Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).
# Type your code and comments inside the code chunk
# Ridge model


set.seed(42)
mod_Ridge <- train(y = trainingTrans$brozek_C, 
                  x = trainingTrans[, -c(6,14,16)],
                  trControl = myControl,
                  method = "glmnet")

  1. Report the lambda value of the model with the minimum RMSE.
# Type your code and comments inside the code chunk

head(mod_Ridge$results)
  alpha     lambda     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1  0.10 0.01283119 4.066317 0.7278583 3.363177 0.5252320 0.08811865 0.4741673
2  0.10 0.12831190 4.057849 0.7291433 3.344677 0.5279196 0.08346824 0.4847986
3  0.10 1.28311897 4.223791 0.7120242 3.475669 0.5197174 0.08480505 0.4754329
4  0.55 0.01283119 4.079529 0.7264943 3.372080 0.5276779 0.08771606 0.4825501
5  0.55 0.12831190 4.013064 0.7347680 3.312633 0.5170344 0.08303518 0.4738458
6  0.55 1.28311897 4.319633 0.7093070 3.572041 0.4933920 0.09235669 0.4681452
min(mod_Ridge$results$RMSE)
[1] 3.993638
mod_Ridge$bestTune
  alpha    lambda
8     1 0.1283119
mod_Ridge$bestTune$lambda
[1] 0.1283119

The lambda value of the model with the minimum RMSE is 0.1283119 which returned the minimum RMSE of 3.993638


  1. Use the train() function with method = "glmnet" and assign the object myControl to the trControl argument of the train() function to fit a lasso model where the goal is to predict body fat. Use brozek_C as the response and store the results of train() in mod_lasso. In the expand.grid() function set the argument alpha = 1 in order to obtain a lasso model. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated. Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).
# Type your code and comments inside the code chunk
# Lasso model: alpha = 1


set.seed(42)
mod_lasso <- train(y = trainingTrans$brozek_C, 
                   x = trainingTrans[, -c(6,14,16)],
                   trControl = myControl,
                   method = "glmnet",
                   tuneGird = expand.grid(alpha = 1, lambda = 0))


plot(mod_lasso)

  1. Report the lambda value of the model with the minimum RMSE.
# Type your code and comments inside the code chunk

head(mod_lasso$results)
  alpha     lambda     RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
1  0.10 0.01283119 4.066317 0.7278583 3.363177 0.5252320 0.08811865 0.4741673
2  0.10 0.12831190 4.057849 0.7291433 3.344677 0.5279196 0.08346824 0.4847986
3  0.10 1.28311897 4.223791 0.7120242 3.475669 0.5197174 0.08480505 0.4754329
4  0.55 0.01283119 4.079529 0.7264943 3.372080 0.5276779 0.08771606 0.4825501
5  0.55 0.12831190 4.013064 0.7347680 3.312633 0.5170344 0.08303518 0.4738458
6  0.55 1.28311897 4.319633 0.7093070 3.572041 0.4933920 0.09235669 0.4681452
min(mod_lasso$results$RMSE)
[1] 3.993638
mod_lasso$bestTune
  alpha    lambda
8     1 0.1283119
mod_lasso$bestTune$lambda
[1] 0.1283119

The lambda value of the model with the minimum RMSE is 0.1283119 which returned the minimum RMSE of 3.993638


  1. Use the train() function with method = "glmnet" and assign the object myControl to the trControl argument of the train() fucntion where the goal is to predict body fat using an elastic net model. Use brozek_C as the response and store the results of train() in mod_EN. Use set.seed(42) for reproducibility. Do not include any predictors that are perfectly correlated. Use a custom tuning grid with arguments to expand.grid() of alpha = seq(0.1, 0.5, length = 11) and lambda = seq(0.01, 100, length = 20). Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).
# Type your code and comments inside the code chunk
# Elastic net model



set.seed(42)
mod_EN <- train(y =trainingTrans$brozek_C,  
                x = trainingTrans[, c(6,14,16)],
                trControl = myControl, 
                method = "glmnet",
                tuneGrid = expand.grid(alpha = seq(0.1, 0.5, length = 11), lambda = seq(0.01, 100, length = 100)))
plot(mod_EN)


  1. Report the lambda and alpha values of the model with the minimum RMSE.
# Type your code and comments inside the code chunk

head(mod_EN$results)
  alpha lambda      RMSE  Rsquared       MAE     RMSESD   RsquaredSD      MAESD
1   0.1   0.01 0.2309676 0.9994439 0.1875875 0.02774899 0.0002483409 0.02724136
2   0.1   1.02 1.0826012 0.9890666 0.8843221 0.11657826 0.0047021887 0.11263046
3   0.1   2.03 1.6785605 0.9776534 1.3826233 0.17272136 0.0092733681 0.15818371
4   0.1   3.04 2.1219204 0.9696317 1.7555153 0.21890105 0.0123563454 0.19029540
5   0.1   4.05 2.4918197 0.9640016 2.0651008 0.26110397 0.0144641787 0.21856018
6   0.1   5.06 2.8162376 0.9599188 2.3370547 0.30002053 0.0159804263 0.24295944
min(mod_EN$results$RMSE)
[1] 0.2292835
mod_EN$bestTune
    alpha lambda
801  0.42   0.01
mod_EN$bestTune$alpha
[1] 0.42
mod_EN$bestTune$lambda
[1] 0.01

The lambda value of the model with the minimum RMSE is 0.01 and the alpha value of the model with the minimum RMSE is 0.42 which returned the minimum RMSE of 3.993638


This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt


References

Ambler, Gareth, and Axel Benner. 2022. Mfp: Multivariable Fractional Polynomials. https://CRAN.R-project.org/package=mfp.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. 2 edition. Springer.
Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. 1 edition. Boca Raton: Chapman; Hall/CRC.
Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67. https://doi.org/10.2307/1267351.
Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” Journal of Statistics Education 4 (1). https://doi.org/10.1080/10691898.1996.11910505.
Matloff, Norman. 2017. Statistical Regression and Classification: From Linear Models to Machine Learning. 1 edition. Boca Raton: Chapman; Hall/CRC.
Ripley, Brian. 2022. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. http://www.stats.ox.ac.uk/pub/MASS4/.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1): 267–88. http://www.jstor.org/stable/2346178.
Wei, Taiyun, and Viliam Simko. 2021. Corrplot: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.