Type complete sentences to answer all questions inside the answer
tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R
code to report numeric answers inside the answer
tags (i.e. do not hard code your numeric answers).
In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication on October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables at http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by Ambler and Benner (2022) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017).
The purpose of this activity is to have the reader create several regression models to predict the body fat of males. Load a cleaned version of the data available from https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv into your R
session using the read.csv()
function. Use the head()
function to view the first six rows of the data frame bodyfatClean
.
# Type your code and comments inside the code chunk
library(readr)
<- read.csv("https://raw.githubusercontent.com/alanarnholt/MISCD/master/bodyfatClean.csv")
bodyfatClean head(bodyfatClean)
age weight_lbs height_in neck_cm chest_cm abdomen_cm hip_cm thigh_cm knee_cm
1 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
2 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3
5 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2
6 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0
ankle_cm biceps_cm forearm_cm wrist_cm brozek_C bmi_C age_sq abdomen_wrist
1 21.9 32.0 27.4 17.1 12.6 23.6 529 68.1
2 23.4 30.5 28.9 18.2 6.9 23.3 484 64.8
3 24.0 28.8 25.2 16.6 24.6 24.7 484 71.3
4 22.8 32.4 29.4 18.2 10.9 24.9 676 68.2
5 24.0 32.2 27.7 17.7 27.8 25.5 576 82.3
6 25.6 35.7 30.6 18.8 20.5 26.5 576 75.6
am
1 181.9365
2 169.1583
3 195.5067
4 182.7203
5 190.6993
6 190.7289
glimpse()
function from the dplyr
package written by Wickham et al. (2022) to view the structure of bodyfatClean
.# Type your code and comments inside the code chunk
library(dplyr)
glimpse(bodyfatClean)
Rows: 251
Columns: 18
$ age <int> 23, 22, 22, 26, 24, 24, 26, 25, 25, 23, 26, 27, 32, 30, …
$ weight_lbs <dbl> 154.25, 173.25, 154.00, 184.75, 184.25, 210.25, 181.00, …
$ height_in <dbl> 67.75, 72.25, 66.25, 72.25, 71.25, 74.75, 69.75, 72.50, …
$ neck_cm <dbl> 36.2, 38.5, 34.0, 37.4, 34.4, 39.0, 36.4, 37.8, 38.1, 42…
$ chest_cm <dbl> 93.1, 93.6, 95.8, 101.8, 97.3, 104.5, 105.1, 99.6, 100.9…
$ abdomen_cm <dbl> 85.2, 83.0, 87.9, 86.4, 100.0, 94.4, 90.7, 88.5, 82.5, 8…
$ hip_cm <dbl> 94.5, 98.7, 99.2, 101.2, 101.9, 107.8, 100.3, 97.1, 99.9…
$ thigh_cm <dbl> 59.0, 58.7, 59.6, 60.1, 63.2, 66.0, 58.4, 60.0, 62.9, 63…
$ knee_cm <dbl> 37.3, 37.3, 38.9, 37.3, 42.2, 42.0, 38.3, 39.4, 38.3, 41…
$ ankle_cm <dbl> 21.9, 23.4, 24.0, 22.8, 24.0, 25.6, 22.9, 23.2, 23.8, 25…
$ biceps_cm <dbl> 32.0, 30.5, 28.8, 32.4, 32.2, 35.7, 31.9, 30.5, 35.9, 35…
$ forearm_cm <dbl> 27.4, 28.9, 25.2, 29.4, 27.7, 30.6, 27.8, 29.0, 31.1, 30…
$ wrist_cm <dbl> 17.1, 18.2, 16.6, 18.2, 17.7, 18.8, 17.7, 18.8, 18.2, 19…
$ brozek_C <dbl> 12.6, 6.9, 24.6, 10.9, 27.8, 20.5, 19.0, 12.7, 5.1, 12.0…
$ bmi_C <dbl> 23.6, 23.3, 24.7, 24.9, 25.5, 26.5, 26.2, 23.5, 24.5, 25…
$ age_sq <int> 529, 484, 484, 676, 576, 576, 676, 625, 625, 529, 676, 7…
$ abdomen_wrist <dbl> 68.1, 64.8, 71.3, 68.2, 82.3, 75.6, 73.0, 69.7, 64.3, 69…
$ am <dbl> 181.9365, 169.1583, 195.5067, 182.7203, 190.6993, 190.72…
Now that you have seen the structure of the data and have studied the research question, answer the following questions.
bodyfatClean
?# Type your code and comments inside the code chunk
dim(bodyfatClean)
[1] 251 18
There are 251 observations and 18 variables in the bodyfatClean
data set.
The name of the response variable is brozek_C
.
There are 17 predictor variables available to use in this data set.
All of the predictor variables are numerical which means that none of them are categorical.
When building a predictive model with a sufficiently large data set, it is common practice to hold out some fraction (usually less than 50%) of the data as a test set. It is difficult to provide a general rule for the size of the training
and testing
sets as the ideal split depends on the signal to noise ratio in the data (Hastie, Tibshirani, and Friedman 2009).
Use the creatDataPartition()
function from the caret
package to partition the data in to training
and testing
.
For illustration purposes, the Boston
data set from the MASS package written by Ripley (2022) is used to illustrate various steps in predictive model building. The Boston
help file indicates the data set consists of 506 observations on 14 different variables for houses in Boston collected in 1978. To open the Boston
help file, type ?Boston
at the R prompt once the MASS package has been loaded. The goal is to predict the median house price (medv
) in Boston. The Boston
data set is divided into a training set containing roughly 80% of the observations and a testing set containing roughly 20% of the observations. Before calling the createDataPartition()
function, it is important to set a seed to ensure the data partition is reproducible.
The arguments y
, p
, list
and times
can be used with the createDataPartition()
function. These arguments represent a vector of outcomes (Boston$medv
), the percentage of data that goes to training (0.80
), should the results be in a list (FALSE
) and the number of partitions to create (1
) respectively. The result from using createDataPartition()
is a vector of indices one can use to create the training set.
library(caret) # load the caret package
library(MASS) # load MASS package
set.seed(3178) # set seed for reproducibility
<- createDataPartition(y = Boston$medv,
trainIndexB p = 0.80,
list = FALSE,
times = 1)
<- Boston[trainIndexB, ]
trainingB <- Boston[-trainIndexB, ]
testingB
dim(trainingB) # Check the dimension of the training set
[1] 407 14
dim(testingB) # Check the dimension of the testing set
[1] 99 14
bodyfatClean
into training and testing partitions where roughly 80% of the data is used for training and roughly 20% of the data is used for testing. To ensure reproducibility of the partition, use set.seed(314)
. The response variable should be brozek_C
(the computed brozek based on the reported density).# Type your code and comments inside the code chunk
library(caret)
library(MASS)
set.seed(314)
<- createDataPartition(y = bodyfatClean$brozek_C,
bodyfatCleanIndex p = 0.80,
list = FALSE,
times = 1)
<- bodyfatClean[bodyfatCleanIndex, ]
trainingBodyfatClean <- bodyfatClean[-bodyfatCleanIndex, ] testingBodyfatClean
dim()
function to verify the sizes of the training
and testing
data sets.# Type your code and comments inside the code chunk
dim(trainingBodyfatClean)
[1] 203 18
dim(testingBodyfatClean)
[1] 48 18
The size of the training
data set is 203 and the size of the testing
data set is 43.
Some algorithms work better when the predictors are on the same scale. This section considers the preProcess()
function for the caret package to find potentially helpful transformations of the training predictors. Three different transformations are considered: center
, scale
, and BoxCox
. A center transform computes the mean of a variable and subtracts the computed mean from each value of the variable. A scale transform computes the standard deviation of a variable and divides each value of the variable by the computed standard deviation. Using both a center and a scale transform standardizes a variable. That is, using both center and scale on a variable creates a variable with a mean of 0 and a standard deviation of 1. When all values of a variable are positive, a BoxCox
transform will reduce the skew of a variable, making it more Gaussian.
The following R
Code applies a center
, scale
, and BoxCox
transform to all the predictors in trainingB
(the training set for the Boston data) and stores the results in pp_trainingB
. The computed transformations are applied to both the trainingB
and the testingB
data sets using the predict()
function with the results stored in the objects trainingTransB
and testingTransB
, respectively. Note that in the Boston data set the response (medv
) is the last column (\(14^{\text{th}}\)) of the training data frame and is removed before pre-processing with trainingB[ , -14]
.
<- preProcess(trainingB[ , -14],
pp_trainingB method = c("center", "scale", "BoxCox"))
pp_trainingB
Created from 407 samples and 13 variables
Pre-processing:
- Box-Cox transformation (11)
- centered (13)
- ignored (0)
- scaled (13)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00 -0.15 0.20 0.40 0.90 2.00
<- predict(pp_trainingB, trainingB)
trainingTransB <- predict(pp_trainingB, testingB) testingTransB
Your turn now to work with the bodyfatClean
data frame.
bodyfatClean
where brozek_C
is stored.# Type your code and comment inside the code chunk
which(colnames(bodyfatClean) == "brozek_C")
[1] 14
The column number of bodyfatClean
where brozek_C
is stored is 14.
preProcess()
function to transform the predictors that are in the training
data set crated in Problem 6. Specifically, pass a vector with “center,” “scale,” and “BoxCox” to the method =
argument of preProcess()
. Make sure not to transform the response (brozek_C
). Store the results in an object named pp_training
.# Type your code and comments inside the code chunk
<- preProcess(trainingBodyfatClean[, -14],
pp_training method = c("center", "scale", "BoxCox"))
pp_training
Created from 203 samples and 17 variables
Pre-processing:
- Box-Cox transformation (17)
- centered (17)
- ignored (0)
- scaled (17)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.6000 -1.0000 -0.4000 -0.2706 0.3000 2.0000
predict()
function to construct a transformed training set and a transformed testing set. Name the new transformed data sets trainingTrans
and testingTrans
, respectively.# Type your code and comments inside the code chunk
<- predict(pp_training, trainingBodyfatClean)
trainingTrans <- predict(pp_training, testingBodyfatClean) testingTrans
\(k\)-fold cross validation divides the data into \(k\) subsets and performs the holdout method \(k\) times. Specifically, one of the \(k\) subsets is used as the test set and the other \(k - 1\) subsets are put together to form a training set. The average MSE across all \(k\) trials is computed and is denoted \(CV_{(k)}\)
The trainControl()
function from caret specifies the resampling procedure to be used inside the train()
function. Resampling procedures include \(k\)-fold cross-validation (once or repeated), leave-one-out cross-validation, and bootstrapping. The following R
code creates a myControlB
object that will signal a 10-fold repeated five times cross-validation scheme (50 resamples in total) to the train()
function for the Boston
data set. Note that the argument savePredictions = "final"
saves the hold-out predictions for the optimal tuning parameters.
<- trainControl(method = "repeatedcv",
myControlB number = 10,
repeats = 5,
savePredictions = "final")
trainControl()
function to define the resampling method (repeated cross-validation), the number of resampling iterations (10), and the number of repeats or complete sets to generate (5), storing the results in the object myControl
.# Type your code and comments inside the code chunk
# Define the type of resampling
<- trainControl(method = "repeatedcv",
myControl number = 10,
repeats = 5,
savePredictions = "final")
Let’s denote the response variable by \(Y\) (always quantitative). \(p\) predictor variables will be denoted by \(X_1, X_2,\ldots, X_p\) (all quantitative). Some authors refer to the response variable as the dependent variable while the predictor variables are sometimes referred to as independent variables, explanatory variables, features or covariates. The relationship between \(Y\) and \(X_1, X_2,\ldots, X_p\) will be expressed as
\[\begin{equation} Y = f(X_1, X_2,\ldots, X_p) + \epsilon. \tag{5.1} \end{equation}\]
In (5.1) \(f\) is some fixed but unknown function of \(X_1, X_2,\ldots, X_p\), and \(\epsilon\) is a random error term independent of \(X_1, X_2,\ldots, X_p\) with mean zero. For predictive models, the goal is to estimate \(f\) with \(\hat{f}\) so that \(\hat{Y} = \hat{f}(X_1, X_2,\ldots, X_p)\) yields accurate predictions for \(Y\). If one assumes \(f\) is linear in \(X_1, X_2,\ldots, X_p\), the model is expressed as the linear model as shown in (5.2).
\[\begin{equation} Y = f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_pX_p + \epsilon \tag{5.2} \end{equation}\]
By assuming the functional form of \(f\) is linear, the problem becomes one of finding estimates for the parameters \(\beta_0, \beta_1, \ldots, \beta_p\) which is generally done with least squares. The least squares estimators are obtained by minimizing the sum of squared residuals (RSS), where
\[\begin{equation} \text{RSS} = \sum_{i=1}^n(y_i-\hat{y}_i)^2, \tag{5.3} \end{equation}\]
and \(\hat{y}_i = \hat\beta_0 + \hat\beta_1X_1 + \hat\beta_2X_2 + \cdots + \hat\beta_pX_p.\)
To fit a model with a particular algorithm, the name of the algorithm is given to the method
argument of the train()
function. The train()
function accepts a formula interface provided the data is also specified in the function. Following R Code fits a linear model by regressing medv
on all of the predictors in the training data set using the dot indicator which selects all predictors (medv ∼ .)
. The preferred way to train a model is by passing the response vector to the y
argument and a data frame of the predictors or a matrix of the predictors to the x
argument of train()
. Both approaches are shown in the below R
code.
# Train linear model with `method = "lm"` to fit a Least Squares Regression model
# Approach 1
set.seed(31)
<- train(medv ~ .,
mod_lm data = trainingTransB,
trControl = myControlB,
method = "lm")
# Approach 2
set.seed(31)
<- train(y = trainingTransB$medv, # response
mod_lm2 x = trainingTransB[, -14], # predictors
trControl = myControlB,
method = "lm")
$results # CV results mod_lm2
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 4.513069 0.7653955 3.328636 0.8140112 0.07911475 0.472427
$results$RMSE # RMSE mod_lm2
[1] 4.513069
# Approach 3
# Should give the same results....but no???
set.seed(31)
<- train(y = trainingB$medv,
mod_lm3 x = trainingB[, -14],
# data = trainingB,
trControl = myControlB,
preProcess = c("center", "scale", "BoxCox"),
method = "lm")
$results mod_lm3
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 4.508361 0.7657758 3.327152 0.8079126 0.07917714 0.4632514
#
set.seed(31)
<- train(medv ~. ,
mod_lm4 data = trainingB,
trControl = myControlB,
preProcess = c("center", "scale", "BoxCox"),
method = "lm")
$results mod_lm4
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 4.508361 0.7657758 3.327152 0.8079126 0.07917714 0.4632514
The average root mean squared error (RMSE) of the 50 resamples is 4.5131.
summary(mod_lm2) # summary of lm object
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-13.3154 -2.6423 -0.2905 2.1241 22.1259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.5897 0.2208 102.325 < 2e-16 ***
crim 0.7459 0.6103 1.222 0.22236
zn 0.3244 0.3195 1.015 0.31055
indus -0.6480 0.4265 -1.519 0.12947
chas 0.5081 0.2285 2.224 0.02671 *
nox -2.3616 0.5841 -4.043 6.35e-05 ***
rm 1.8848 0.3224 5.846 1.06e-08 ***
age 0.1898 0.4186 0.453 0.65048
dis -3.8574 0.5130 -7.519 3.77e-13 ***
rad 1.3638 0.4755 2.868 0.00435 **
tax -1.9369 0.4335 -4.469 1.03e-05 ***
ptratio -1.6714 0.2847 -5.871 9.23e-09 ***
black 0.7120 0.2642 2.694 0.00735 **
lstat -5.4608 0.4089 -13.356 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.454 on 393 degrees of freedom
Multiple R-squared: 0.7785, Adjusted R-squared: 0.7712
F-statistic: 106.3 on 13 and 393 DF, p-value: < 2.2e-16
train()
function with method = "lm"
and assign the object myControl
to the trControl
argument of the train()
function to fit a least squares regression model where the goal is to predict body fat using the pre-processed data in trainingTrans
created in Problem 10. Use brozek_C
as the response and store the results of train()
in mod_lm
. Use set.seed(42)
for reproducibility. Use mod_lm$results
to investigate the model with minimum RMSE. Use the summary()
function to view the coefficient estimates of the final model.# Type your code and comments inside the code chunk
set.seed(42)
<- train(brozek_C ~.,
mod_lm data = trainingTrans,
trControl = myControl,
method = "lm")
$results mod_lm
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 4.069335 0.7276702 3.375117 0.5279073 0.08777636 0.4501699
summary(mod_lm)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-8.6677 -2.8107 -0.5457 2.8280 9.6054
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.060591 0.275500 69.186 <2e-16 ***
age 0.586456 0.425226 1.379 0.1695
weight_lbs 0.654684 3.765901 0.174 0.8622
height_in -4.765988 1.985589 -2.400 0.0174 *
neck_cm -0.786123 0.533265 -1.474 0.1421
chest_cm -1.859286 0.898798 -2.069 0.0400 *
abdomen_cm -10.158945 27.125224 -0.375 0.7084
hip_cm -1.587251 0.965760 -1.644 0.1020
thigh_cm 0.033781 0.740120 0.046 0.9636
knee_cm 0.196184 0.581074 0.338 0.7360
ankle_cm -0.008401 0.489140 -0.017 0.9863
biceps_cm 0.393078 0.511790 0.768 0.4434
forearm_cm 0.406429 0.508976 0.799 0.4256
wrist_cm -0.053554 2.469653 -0.022 0.9827
bmi_C 20.221467 9.666750 2.092 0.0378 *
age_sq NA NA NA NA
abdomen_wrist 17.101771 25.680106 0.666 0.5063
am -18.865536 8.053578 -2.343 0.0202 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.925 on 186 degrees of freedom
Multiple R-squared: 0.7608, Adjusted R-squared: 0.7402
F-statistic: 36.97 on 16 and 186 DF, p-value: < 2.2e-16
corrplot()
function from the corrplot package written by Wei and Simko (2021) to identify predictors that may be linearly related in trainingTrans
. Are any of the variables colinear? If so, remove the predictor that is least correlated to the response variable. Note that when method = "number"
is used with corrplot()
, color coded numerical correlations are displayed.# Type your code and comments inside the code chunk
# Identifying linearly related predictors
library(corrplot)
<- cor(trainingTrans)
cor corrplot(cor, type = 'upper', method = "number")
<- trainingTrans[-c(6,14,16) ] traningTrans
Some variables that are colinear are age
and age_sq
as well as abdomen_cm
and abdomen_wrist
.
Forward selection starts with the null model (a model created from regressing the response on the intercept) then adds predictors one at a time to the model by regressing the response on all of the possible predictors one at a time and adding the predictor that makes the RSS as small as possible or that creates the largest \(R^2\). This process continues until all variables have been added to the model or a criterion such as a p-value greater than some predetermined value (such as 0.20) tells the algorithm to stop adding variables to the model.
Use the train()
function with method = "leapForward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a forward selection model where the goal is to predict body fat using the pre-processed data trainingTrans
from Problem 10. Use brozek_C
as the response variable and store the results of train()
in mod_FS
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
mod_FS$results
to investigate the model that resulted in the minimum RMSE (final submodel).mod_FS$bestTune
to find out how many variable were selected in the final submodel (nvmax
).summary(mod_FS)
function and find the names of the variables selected in the final submodel with minimum RMSE. (Hint: Go to the part the output under Selection Algorithm: forward
, then look at the row which agrees with the nvmax
from B. The variables with "*"
s are the variables selected in the final submodel.) To find the actual coefficients, one can use coef(mod_FS$finalModel, id = mod_FS$bestTune$nvmax)
.# Type your code and comments inside the code chunk
# Forward Selection
set.seed(42)
<- train(y = trainingTrans$brozek_C,
mod_FS x = trainingTrans[, -c(6,14,16)],
trControl = myControl,
method = "leapForward",
tuneLength = 10)
# A.
$results mod_FS
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 4.026849 0.7313614 3.321276 0.5403124 0.09130784 0.4919037
2 3 4.036475 0.7297018 3.343899 0.5193163 0.08823697 0.4734673
3 4 4.052559 0.7286131 3.364212 0.5138460 0.08627183 0.4693180
4 5 4.096457 0.7224298 3.394292 0.5375727 0.09024240 0.4917228
5 6 4.092255 0.7238529 3.388433 0.5306400 0.09030599 0.4994959
6 7 4.067525 0.7273453 3.364281 0.5442005 0.09010189 0.5080472
7 8 4.099173 0.7238989 3.393444 0.5294074 0.08865337 0.5031480
8 9 4.114850 0.7214243 3.400253 0.5155552 0.09118349 0.4876967
9 10 4.103190 0.7228546 3.383874 0.5205166 0.09132070 0.4788808
10 11 4.058201 0.7289529 3.362046 0.5267089 0.08937780 0.4617132
Using mod_FS$results
to find the smallest RMSE model number 1.
# B.
$bestTune mod_FS
nvmax
1 2
Using the mod_FS$bestTune
the final model used has an nvmax
of 2 variables.
# C.
summary(mod_FS)
Subset selection object
15 Variables (and intercept)
Forced in Forced out
age FALSE FALSE
weight_lbs FALSE FALSE
height_in FALSE FALSE
neck_cm FALSE FALSE
chest_cm FALSE FALSE
hip_cm FALSE FALSE
thigh_cm FALSE FALSE
knee_cm FALSE FALSE
ankle_cm FALSE FALSE
biceps_cm FALSE FALSE
forearm_cm FALSE FALSE
wrist_cm FALSE FALSE
bmi_C FALSE FALSE
abdomen_wrist FALSE FALSE
am FALSE FALSE
1 subsets of each size up to 2
Selection Algorithm: forward
age weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " " " "
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C abdomen_wrist am
1 ( 1 ) " " " " " " " " " " "*" " "
2 ( 1 ) " " " " " " " " " " "*" " "
coef(mod_FS$finalModel, id = mod_FS$bestTune$nvmax)
(Intercept) height_in abdomen_wrist
19.060591 -1.466297 6.628864
Type your complete sentence answer here using inline R code and delete this comment.
The variables used in the final model are height_in
and abdomen_wrist
.
mod_FS
using the testing
data set.# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
<- predict(mod_FS, testingTrans)
yhat <- testingTrans$brozek_C
y <- sqrt(mean((y - yhat)^2))
rmse rmse
[1] 4.20519
Type your complete sentence answer here using inline R code and delete this comment.
The RMSE
was 4.20519.
Backward elimination starts with all predictors in the model (full model) and removes the least significant variables one at a time. The algorithm stops removing variables once a stopping criterion such as p-value less than some predetermined value (such as 0.05) for all predictors in the model is achieved. When the data has more predictors (\(p\)) than observations (\(n\)), backward elimination is no longer a viable approach since the least squares estimates are not unique. While forward selection is a viable approach when \(n < p\), constrained regression is sometimes a better choice by increasing the bias for a relatively large reduction in variance.
Use the train()
function with method = "leapBackward"
, tuneLength = 10
and assign the object myControl
to the trControl
argument of the train()
function to fit a backward selection model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_BE
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated.
mod_BE$results
to investigate which model results in the minimum RMSE (final submodel).mod_BE$bestTune
to find out how many variable were selected to the final submodel (nvmax
)?summary(mod_BE)
function and find the names of the variables selected in the final submodel with minimum RMSE. (Hint: Go to the part the output under Selection Algorithm: backward
, then look at the row which agrees with the nvmax
from B. The variables with "*"
s are the variables selected in the final submodel.) To find the actual coefficients, one can use coef(mod_BE$finalModel, id = mod_BE$bestTune$nvmax)
.# Type your code and comments inside the code chunk
# Fit model with backward elimination
set.seed(42)
<- train(y = trainingTrans$brozek_C,
mod_BE x = trainingTrans[, -c(6,14,16)],
trControl = myControl,
method = "leapBackward",
tuneLength = 10)
# A.
$results mod_BE
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 4.026157 0.7330041 3.319767 0.5336515 0.08900824 0.4892633
2 3 4.035584 0.7313086 3.339755 0.5119942 0.08719440 0.4708414
3 4 4.057963 0.7291062 3.380256 0.5081399 0.08448765 0.4447725
4 5 4.051860 0.7281567 3.385981 0.5183412 0.08811846 0.4654395
5 6 4.081451 0.7242761 3.403985 0.5152274 0.08793244 0.4497012
6 7 4.016873 0.7333198 3.336306 0.5292507 0.08613017 0.4448289
7 8 4.032096 0.7320207 3.347913 0.5075824 0.08491969 0.4396981
8 9 4.033619 0.7313225 3.343201 0.5141581 0.08388861 0.4406242
9 10 3.989002 0.7370281 3.308391 0.5362734 0.08656383 0.4595357
10 11 4.015134 0.7340522 3.330248 0.5276127 0.08650233 0.4549319
Using mod_FS$results
to find the smallest RMSE model number 9.
# B.
$bestTune mod_BE
nvmax
9 10
Using the mod_FS$bestTune
the final model used has an nvmax
of 10 variables.
# C.
summary(mod_BE)
Subset selection object
15 Variables (and intercept)
Forced in Forced out
age FALSE FALSE
weight_lbs FALSE FALSE
height_in FALSE FALSE
neck_cm FALSE FALSE
chest_cm FALSE FALSE
hip_cm FALSE FALSE
thigh_cm FALSE FALSE
knee_cm FALSE FALSE
ankle_cm FALSE FALSE
biceps_cm FALSE FALSE
forearm_cm FALSE FALSE
wrist_cm FALSE FALSE
bmi_C FALSE FALSE
abdomen_wrist FALSE FALSE
am FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: backward
age weight_lbs height_in neck_cm chest_cm hip_cm thigh_cm knee_cm
1 ( 1 ) " " " " " " " " " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " " " "
3 ( 1 ) " " " " "*" " " " " " " " " " "
4 ( 1 ) " " " " "*" " " " " " " " " " "
5 ( 1 ) " " " " "*" " " "*" " " " " " "
6 ( 1 ) " " " " "*" " " "*" "*" " " " "
7 ( 1 ) " " " " "*" " " "*" "*" " " " "
8 ( 1 ) " " " " "*" "*" "*" "*" " " " "
9 ( 1 ) "*" " " "*" "*" "*" "*" " " " "
10 ( 1 ) "*" " " "*" "*" "*" "*" " " " "
ankle_cm biceps_cm forearm_cm wrist_cm bmi_C abdomen_wrist am
1 ( 1 ) " " " " " " " " " " "*" " "
2 ( 1 ) " " " " " " " " " " "*" " "
3 ( 1 ) " " " " " " " " " " "*" "*"
4 ( 1 ) " " " " " " " " "*" "*" "*"
5 ( 1 ) " " " " " " " " "*" "*" "*"
6 ( 1 ) " " " " " " " " "*" "*" "*"
7 ( 1 ) " " " " " " "*" "*" "*" "*"
8 ( 1 ) " " " " " " "*" "*" "*" "*"
9 ( 1 ) " " " " " " "*" "*" "*" "*"
10 ( 1 ) " " "*" " " "*" "*" "*" "*"
coef(mod_BE$finalModel, id = mod_BE$bestTune$nvmax)
(Intercept) age height_in neck_cm chest_cm
19.0605911 0.5383486 -4.1470837 -0.7327354 -1.8557702
hip_cm biceps_cm wrist_cm bmi_C abdomen_wrist
-1.4764111 0.5588755 -0.8520107 19.9908063 7.4474726
am
-17.9423020
The variables used in the final model are age
, height_in
, neck_cm
, chest_cm
, hip_cm
, biceps_cm
, wrist_cm
, bmi_C
, abdomen_wrist
, and am
.
mod_BE
using the testing
data set.# Type your code and comments inside the code chunk
# Computing RMSE on the testing set
<- predict(mod_BE, testingTrans)
yhat <- testingTrans$brozek_C
y <- sqrt(mean((y - yhat)^2))
rmse rmse
[1] 4.000992
Type your complete sentence answer here using inline R code and delete this comment.
The RMSE
was 4.000992.
Constrained regression shrinks the coefficient estimates towards zero while reducing their variance by adding a small amount of bias. Two popular forms of constrained regression are ridge regression developed by Hoerl and Kennard (1970) and the least absolute shrinkage and selection operator (LASSO) developed by Tibshirani (1996).
The ridge regression coefficient estimates denoted by \(\hat{\beta}^R\) are values that minimize \[\begin{equation} \sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p\beta^2_j = RSS + \lambda\sum_{j=1}^p\beta^2_j, \tag{5.4} \end{equation}\]
where \(\lambda \ge 0\) is a tuning parameter. Similar to least squares, ridge regression seeks estimates that make RSS small. However, the quantity \(\lambda\sum_{j=1}^p\beta^2_j\) is small when the estimates of the \(\beta\)’s are close to zero. That is, the quantity \(\lambda\sum_{j=1}^p\beta^2_j\) is a shrinkage penalty. When \(\lambda = 0\), there is no penalty and ridge regression will return the least squares estimates. As \(\lambda \rightarrow \infty\), the shrinkage penalty grows and the estimates of the \(\beta\)’s approach zero. While least squares regression returns one set of estimates for the \(\beta\)’s, ridge regression returns a different set of estimates, \(\hat{\beta}^R\), for each value of \(\lambda\). Selecting a good value of \(\lambda\) can be accomplished with cross-validation. The final model with ridge regression will include all \(p\) predictors. The penalty used by ridge regression is known as an \(\ell_2\) norm. Using an \(\ell_1\) norm for the shrinkage penalty instead of an \(\ell_2\) norm, results in the lasso model (Tibshirani 1996).
The lasso coefficients, \(\hat{\beta}^L\), minimize the quantity
\[\begin{equation} \sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\sum_{j=1}^p|\beta_j| = RSS + \lambda\sum_{j=1}^p|\beta_j| \tag{5.5} \end{equation}\]
As with ridge regression, lasso returns a set of coefficient estimates, \(\hat{\beta}^L\), for each value of \(\lambda\). Selecting a good value of \(\lambda\) (known as tuning) can be accomplished with cross-validation. Unlike the final ridge regression model which includes all \(p\) predictors, the lasso model performs variable selection and returns coefficient estimates for only a subset of the original predictors making the final model easier to interpret. Unfortunately, the lasso does not handle correlated variables very well (Hastie, Tibshirani, and Wainwright 2015). The elastic net (Zou and Hastie 2005) provides a compromise between ridge and lasso penalties.
The elastic net minimizes the quantity
\[\begin{equation} \frac{1}{2}\sum_{i=1}^n(y_i-\hat{y}_i)^2 + \lambda\left[\frac{1}{2}(1 - \alpha)||\beta||_2^2 + \alpha||\beta||_1 \right] , \tag{5.6} \end{equation}\]
where \(||\beta||_2^2\) is the \(\ell_2\) norm of \(\beta\), \(\sum_{j=1}^p\beta^2_j\), and \(||\beta||_1\) is the \(\ell_1\) norm of \(\beta\), \(\sum_{j=1}^p|\beta_j|\). The penalty applied to an individual coefficient ignoring the value of \(\lambda\) is
\[\begin{equation} \frac{1}{2}(1 - \alpha)\beta_j^2 + \alpha|\beta_j|. \tag{5.7} \end{equation}\]
When \(\alpha = 1\), the penalty in (5.7) reduces to the lasso penalty, and when \(\alpha = 0\), the penalty in (5.7) reduces to the ridge penalty. Selecting good values of \(\alpha\) and \(\lambda\) can be accomplished with cross-validation.
The following R
code fits a ridge regression model for medv
using the Boston
data with the train()
function and method = "glmnet"
. In the expand.grid()
function, use the argument alpha = 0
to specify a ridge model. lambda = seq(0.01, 100, length = 100)
creates a user defined grid of lambda
values used in tuning the ridge regression.
# Fit constrained model (elastic net)---Ridge model
set.seed(42)
<- train(y =trainingTransB$medv, # response
mod_RidgeB x = trainingTransB[, -14], # predictors
trControl = myControlB,
method = "glmnet",
tuneGrid = expand.grid(alpha =0, lambda = seq(0.01, 100, length = 100)))
head(mod_RidgeB$results)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0 0.01 4.520299 0.7662803 3.240542 1.061121 0.1018991 0.5715856
2 0 1.02 4.549192 0.7640606 3.237639 1.090177 0.1054067 0.5729318
3 0 2.03 4.670858 0.7556412 3.274392 1.152884 0.1144966 0.5801409
4 0 3.04 4.796116 0.7477187 3.344541 1.181727 0.1201087 0.5831796
5 0 4.05 4.916809 0.7403484 3.422373 1.196029 0.1239887 0.5863119
6 0 5.06 5.031277 0.7333883 3.500951 1.203254 0.1268702 0.5908202
min(mod_RidgeB$results$RMSE)
[1] 4.520299
$bestTune mod_RidgeB
alpha lambda
1 0 0.01
$bestTune$lambda mod_RidgeB
[1] 0.01
$bestTune$alpha mod_RidgeB
[1] 0
mod_Ridge$results
displays the RMSE
for all the models for each value of the tuning parameter lambda
(100 models in this case). mod_RidgeB$bestTune
gives the lambda
value of the model with the minimum RMSE
. In this case, the best lambda
value was 0.01 which returned the minimum RMSE
of 4.520299.
train()
function with method = "glmnet"
to fit a constrained linear regression model named mod_Ridge
. Assign the object myControl
to the trControl
argument of the train()
function to fit a ridge model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_Ridge
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated. Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).# Type your code and comments inside the code chunk
# Ridge model
set.seed(42)
<- train(y = trainingTrans$brozek_C,
mod_Ridge x = trainingTrans[, -c(6,14,16)],
trControl = myControl,
method = "glmnet")
lambda
value of the model with the minimum RMSE
.# Type your code and comments inside the code chunk
head(mod_Ridge$results)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.10 0.01283119 4.066317 0.7278583 3.363177 0.5252320 0.08811865 0.4741673
2 0.10 0.12831190 4.057849 0.7291433 3.344677 0.5279196 0.08346824 0.4847986
3 0.10 1.28311897 4.223791 0.7120242 3.475669 0.5197174 0.08480505 0.4754329
4 0.55 0.01283119 4.079529 0.7264943 3.372080 0.5276779 0.08771606 0.4825501
5 0.55 0.12831190 4.013064 0.7347680 3.312633 0.5170344 0.08303518 0.4738458
6 0.55 1.28311897 4.319633 0.7093070 3.572041 0.4933920 0.09235669 0.4681452
min(mod_Ridge$results$RMSE)
[1] 3.993638
$bestTune mod_Ridge
alpha lambda
8 1 0.1283119
$bestTune$lambda mod_Ridge
[1] 0.1283119
The lambda
value of the model with the minimum RMSE
is 0.1283119 which returned the minimum RMSE
of 3.993638
train()
function with method = "glmnet"
and assign the object myControl
to the trControl
argument of the train()
function to fit a lasso model where the goal is to predict body fat. Use brozek_C
as the response and store the results of train()
in mod_lasso
. In the expand.grid()
function set the argument alpha = 1
in order to obtain a lasso model. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated. Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).# Type your code and comments inside the code chunk
# Lasso model: alpha = 1
set.seed(42)
<- train(y = trainingTrans$brozek_C,
mod_lasso x = trainingTrans[, -c(6,14,16)],
trControl = myControl,
method = "glmnet",
tuneGird = expand.grid(alpha = 1, lambda = 0))
plot(mod_lasso)
lambda
value of the model with the minimum RMSE
.# Type your code and comments inside the code chunk
head(mod_lasso$results)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.10 0.01283119 4.066317 0.7278583 3.363177 0.5252320 0.08811865 0.4741673
2 0.10 0.12831190 4.057849 0.7291433 3.344677 0.5279196 0.08346824 0.4847986
3 0.10 1.28311897 4.223791 0.7120242 3.475669 0.5197174 0.08480505 0.4754329
4 0.55 0.01283119 4.079529 0.7264943 3.372080 0.5276779 0.08771606 0.4825501
5 0.55 0.12831190 4.013064 0.7347680 3.312633 0.5170344 0.08303518 0.4738458
6 0.55 1.28311897 4.319633 0.7093070 3.572041 0.4933920 0.09235669 0.4681452
min(mod_lasso$results$RMSE)
[1] 3.993638
$bestTune mod_lasso
alpha lambda
8 1 0.1283119
$bestTune$lambda mod_lasso
[1] 0.1283119
The lambda
value of the model with the minimum RMSE
is 0.1283119 which returned the minimum RMSE
of 3.993638
train()
function with method = "glmnet"
and assign the object myControl
to the trControl
argument of the train()
fucntion where the goal is to predict body fat using an elastic net model. Use brozek_C
as the response and store the results of train()
in mod_EN
. Use set.seed(42)
for reproducibility. Do not include any predictors that are perfectly correlated. Use a custom tuning grid with arguments to expand.grid()
of alpha = seq(0.1, 0.5, length = 11)
and lambda = seq(0.01, 100, length = 20)
. Produce a plot of RMSE versus the regularization parameter (\(\lambda\)).# Type your code and comments inside the code chunk
# Elastic net model
set.seed(42)
<- train(y =trainingTrans$brozek_C,
mod_EN x = trainingTrans[, c(6,14,16)],
trControl = myControl,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0.1, 0.5, length = 11), lambda = seq(0.01, 100, length = 100)))
plot(mod_EN)
lambda
and alpha
values of the model with the minimum RMSE
.# Type your code and comments inside the code chunk
head(mod_EN$results)
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 0.1 0.01 0.2309676 0.9994439 0.1875875 0.02774899 0.0002483409 0.02724136
2 0.1 1.02 1.0826012 0.9890666 0.8843221 0.11657826 0.0047021887 0.11263046
3 0.1 2.03 1.6785605 0.9776534 1.3826233 0.17272136 0.0092733681 0.15818371
4 0.1 3.04 2.1219204 0.9696317 1.7555153 0.21890105 0.0123563454 0.19029540
5 0.1 4.05 2.4918197 0.9640016 2.0651008 0.26110397 0.0144641787 0.21856018
6 0.1 5.06 2.8162376 0.9599188 2.3370547 0.30002053 0.0159804263 0.24295944
min(mod_EN$results$RMSE)
[1] 0.2292835
$bestTune mod_EN
alpha lambda
801 0.42 0.01
$bestTune$alpha mod_EN
[1] 0.42
$bestTune$lambda mod_EN
[1] 0.01
The lambda
value of the model with the minimum RMSE
is 0.01 and the alpha
value of the model with the minimum RMSE
is 0.42 which returned the minimum RMSE
of 3.993638
This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt