In the article Fitting Percentage of Body Fat to Simple Body Measurements, Johnson (1996) uses the data at http://jse.amstat.org/datasets/fat.dat.txt provided to him by Dr. A. Garth Fischer in a personal communication on October 5, 1994, as a multiple linear regression activity with his students. A subset of the variables at http://jse.amstat.org/datasets/fat.dat.txt is available in the R package mfp by (R-mfp?) and the data set is used frequently in the text Statistical Regression and Classification by Matloff (2017).
The purpose of this activity is to have the reader critically question, evaluate, and clean the original data provided at http://jse.amstat.org/datasets/fat.dat.txt. The guided activities will reinforce reading data into R using the fread()
function from the data.table package written by Dowle and Srinivasan (2021), creating graphs with both packages ggplot2 and plotly written by Wickham, Chang, et al. (2021) and Sievert et al. (2021), respectively, and creating new variables with functions such as mutate()
from the dplyr package written by Wickham, François, et al. (2021).
Type complete sentences to answer all questions inside the answer
tags provided in the R Markdown document. Round all numeric answers you report inside the answer tags to four decimal places. Use inline R
code to report numeric answers inside the answer
tags (i.e. do not hard code your numeric answers).
The article by Johnson (1996) defines body-fat determined with the brozek and siri methods as well as fat free weight using Equations (1), (2), and (3), respectively.
\[\begin{equation} \text{bodyfatBrozek} = \frac{457}{\text{density}} - 414.2 \tag{1} \end{equation}\]
\[\begin{equation} \text{bodyfatSiri} = \frac{495}{\text{density}} - 450 \tag{2} \end{equation}\]
\[\begin{equation}
\text{FatFreeWeight} = \left(1 -\frac{\text{brozek}}{100}\times \text{weight_lbs}\right)
\tag{3}
\end{equation}\]
Body Mass Index (BMI
) is defined in Equation (4).
\[\begin{equation} \text{BMI} = \frac{\text{kg}}{\text{m}^2} \tag{4} \end{equation}\]
Please use the following conversion factors with this project: 0.453592 kilos per pound and 2.54 centimeters per inch.
For example, a male who weighs 185 pounds and is 71 inches tall has a weight of 83.91 kg and a height of 1.8034 m.
\[\begin{equation} 185 \text{ pounds} \times \frac{0.453592 \text{ kg}}{\text{pounds}} = 83.91452 \text{ kg} \end{equation}\]
\[\begin{equation} 71 \text{ inches} \times \frac{2.54 \text{ cm}}{\text{inches}}\times \frac{1 \text {m}}{100 \text{ cm}} = 1.8034 \text{ m} \end{equation}\]
To read tabular data from the internet, one might use the read.table()
function or the fread()
function from the data.table
package. The general structure to read data from the internet is to provide the URL in the file
or input
arguments of the read.table()
or fread()
functions, respectively.
# Example structure
<- read.table(file = "http://some.url.com")
DF1 # load data.table package
library(data.table)
<- fread(input = "http://some.url.com") DF2
fread()
function from the data.table
package to read the data from http://jse.amstat.org/datasets/fat.dat.txt into an object named bodyfat
. The data dictionary for http://jse.amstat.org/datasets/fat.dat.txt is available at http://jse.amstat.org/datasets/fat.txt.col.names
argument of fread()
:
c("case", "brozek", "siri", "density", "age", "weight_lbs", "height_in", "bmi", "fat_free_weight",
"neck_cm", "chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm", "biceps_cm",
"forearm_cm", "wrist_cm")
.library(data.table)
<- "http://jse.amstat.org/datasets/fat.dat.txt"
url <- fread(url, col.names = c("case", "brozek", "siri", "density",
bodyfat "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm",
"chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm",
"biceps_cm", "forearm_cm", "wrist_cm"))
::kable(head(bodyfat)) knitr
case | brozek | siri | density | age | weight_lbs | height_in | bmi | fat_free_weight | neck_cm | chest_cm | abdomen_cm | hip_cm | thigh_cm | knee_cm | ankle_cm | biceps_cm | forearm_cm | wrist_cm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 12.6 | 12.3 | 1.0708 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
2 | 6.9 | 6.1 | 1.0853 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
3 | 24.6 | 25.3 | 1.0414 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
4 | 10.9 | 10.4 | 1.0751 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
5 | 27.8 | 28.7 | 1.0340 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
6 | 20.6 | 20.9 | 1.0502 | 24 | 210.25 | 74.75 | 26.5 | 167.0 | 39.0 | 104.5 | 94.4 | 107.8 | 66.0 | 42.0 | 25.6 | 35.7 | 30.6 | 18.8 |
read.table()
function to read the data from http://jse.amstat.org/datasets/fat.dat.txt into an object named bodyfat2
. Pass the same vector created in Problem 1 to the col.names
argument of read.table()
.<- read.table(url, col.names = c("case", "brozek", "siri", "density",
bodyfat2 "age", "weight_lbs", "height_in", "bmi", "fat_free_weight", "neck_cm",
"chest_cm", "abdomen_cm", "hip_cm", "thigh_cm", "knee_cm", "ankle_cm",
"biceps_cm", "forearm_cm", "wrist_cm"))
::kable(head(bodyfat2)) knitr
case | brozek | siri | density | age | weight_lbs | height_in | bmi | fat_free_weight | neck_cm | chest_cm | abdomen_cm | hip_cm | thigh_cm | knee_cm | ankle_cm | biceps_cm | forearm_cm | wrist_cm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 12.6 | 12.3 | 1.0708 | 23 | 154.25 | 67.75 | 23.7 | 134.9 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
2 | 6.9 | 6.1 | 1.0853 | 22 | 173.25 | 72.25 | 23.4 | 161.3 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
3 | 24.6 | 25.3 | 1.0414 | 22 | 154.00 | 66.25 | 24.7 | 116.0 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
4 | 10.9 | 10.4 | 1.0751 | 26 | 184.75 | 72.25 | 24.9 | 164.7 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
5 | 27.8 | 28.7 | 1.0340 | 24 | 184.25 | 71.25 | 25.6 | 133.1 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
6 | 20.6 | 20.9 | 1.0502 | 24 | 210.25 | 74.75 | 26.5 | 167.0 | 39.0 | 104.5 | 94.4 | 107.8 | 66.0 | 42.0 | 25.6 | 35.7 | 30.6 | 18.8 |
The basic structure of a function in R is
<- function(argument1, argument2, ...){
function_name
expression }
The function curve()
draws a curve corresponding to a function of the interval [from, to]
. For example, to draw the function \(y = x^2\) from -3 to 3 use the following code to obtain Figure 1.
<- function(x){x^2}
quad curve(quad, from = -3, to = 3)
Figure 1: Using the function curve()
to draw a quadratic function
To superimpose a function on an open graphics device with curve()
, use the argument add = TRUE
. For example, to graph \(y = x^2\) with a blue line and \(y = 4 + \sin(8x)\) with a red line from \(-\pi\) to \(\pi\), use the following code:
<- function(x){x^2}
quad curve(quad, from = -pi, to = pi, col = "blue", ylab = "f(x)")
<- function(x){4 + sin(8*x)}
nsin curve(nsin, add = TRUE, col = "red")
brozek
and density
given in Equation (1) with a blue line as well as the relationship between siri
and density
given in Equation (2) with a red line. First, define functions fb
and fs
for Equations (1) and (2), respectively. Use the function curve()
to draw Equations (1) and (2) on the same graphics device for density values between 1.0 \(\text{gm/cm}^3\) and 1.10 \(\text{gm/cm}^3\).<- function(density){(457/density) - 414.2}
fb <- function(density){(495/density) - 450}
fs
curve(fb, from = 1.0, to = 1.10, n = 200, xlab = expression(paste("density in ", gm/cm^3)),
ylab = "body fat", col = "blue", ylim = c(0, 50))
curve(fs, add = TRUE, col = "red")
Code to construct the requested graph using ggplot2
code is given below and the result is visible in Figure 2.
library(ggplot2)
<- ggplot(data = data.frame(x = c(1, 1.1), y = c(0, 50)), aes(x = x, y = y))
p + stat_function(fun = fb, color = "blue") +
p stat_function(fun = fs, color = "red") +
labs(x = expression(paste("density in ", gm/cm^3)), y = "bodyfat") +
theme_bw()
Figure 2: Realtionship between bodyfat and density according to Brozek and Siri definitions
The theoretical relationship is mostly linear until the density
is about 1.0625
.
Since the relationship between body fat and density for both Equations (1) and (2) is linear, one way to check for unusual observations is to plot the actual values for brozek
versus density
or siri
versus density
using the data stored in bodyfat
or bodyfat2
. Points that do not fall along a straight line should be scrutinized for possible data entry errors or other possible explanations.
The plotly package allows the user to create interactive graphs or to convert ggplot2 graphs to plotly graphs using the ggplotly()
function. Given a data frame say DF
with variables X
and Y
, the following template code will create an interactive plotly scatterplot.
library(plotly)
library(ggplot2)
# Create ggplot scatterplot named p
<- ggplot(data = DF, aes(x = X, y = Y)) +
p geom_point()
# Convert ggplot scatterplot to plotly with ggplotly()
<- ggplotly(p)
p2 # Display graph
p2
Consider an interactive scatterplot showing the distance required to stop for vehicles traveling at different speeds using the cars
data frame.
library(plotly)
library(ggplot2)
# Create ggplot scatterplot named p
<- ggplot(data = cars, aes(x = speed, y = dist)) +
p geom_point()
# Convert ggplot scatterplot to plotly with ggplotly()
<- ggplotly(p)
p2 # Display graph
p2
plotly
scatterplots of siri
versus density
with case
mapped to color
, weight_lbs
versus height_in
with case
mapped to color
, and ankle_cm
versus weight_lbs
with case
mapped to color
to help identify potential outliers. Identify the case numbers for the outliers from your three plots.# Type your code and comments inside the code chunk
# Creating interactive scatterplot of siri versus density
# Change code below to your plot
<- ggplot(data = bodyfat, aes(x = density, y = siri, color = case)) +
f geom_point()
<- ggplotly(f)
f2
f2
Figure 3: Plot of siri
versus density
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of weight_lbs versus height_in
# Change code below to your plot
<- ggplot(data = bodyfat, aes(x = height_in, y = weight_lbs, color = case)) +
q geom_point()
<- ggplotly(q)
q2
q2
Figure 4: Plot of weight_lbs
versus height_in
# Type your code and comments inside the code chunk
# Creating interactive scatterplot of ankle_cm versus weight_lbs
# Change code below to your plot
<- ggplot(data = bodyfat, aes(x = weight_lbs, y = ankle_cm, color = case)) +
a geom_point()
<-ggplotly(a)
a2
a2
Figure 5: Interactive scatterplot of ankle_cm
versus weight_lbs
For the plot of siri
vs density
the outlier case
numbers are case 76, 48, 182 and 96. because they are really far out from the data and some are not in the linear path of the majority of the other data.
For the plot of weight_lbs
vs height_in
the outlier case
numbers are 42 because it does not make sense that a person can weight 205 lbs and only be 29.5 in.
For the plot of ankle_cm
vs weight_lbs
the outlier case
numbers are 86 and 31 because they are not even close to the rest of the data.
bodyfat
data frame named BF
consisting only of the cases identified as outliers in Problem 5. Use equation (2) to create a new variable in BF
named density_C
(computed density) based on the reported siri
values. Use equation (4) to reverse engineer the computed height in inches based on the values in weight_lbs
and bmi
using the conversion factors given at the start of the lab. Store the computed heights in inches in a variable named height_in_C
. Use the verb mutate
from the dplyr package to create both density_C
and height_in_C
. Show the values of the selected outliers for columns case
, density_C
, height_in
, height_in_C
, and ankle_cm
. What do you notice about the density
and density_C
values in BF
for the scatterplot you created in Figure 3? What do you notice about the height_in
values in BF
for the scatterplot you created in Figure 4? What do you notice about the ankle_cm
values in BF
for the scatterplot you created in Figure 5?library(dplyr)
<- bodyfat[c(48, 76, 96, 182, 39, 42, 31, 86), ]
BF # Computing density and height
<- BF %>%
BF mutate(density_C = 495/(siri + 450)) %>%
mutate(height_in_C = 100/2.54*sqrt(weight_lbs*0.453592/bmi)) %>%
select("case", "density", "density_C", "height_in", "height_in_C", "ankle_cm")
BF
case density density_C height_in height_in_C ankle_cm
1: 48 1.0665 1.086479 71.25 71.19157 21.9
2: 76 1.0666 1.056564 67.50 67.46501 21.4
3: 96 1.0991 1.059050 77.75 77.76549 23.2
4: 182 1.1089 1.100000 68.00 67.84516 20.2
5: 39 1.0202 1.020198 72.25 72.25827 29.6
6: 42 1.0250 1.025057 29.50 69.42890 23.7
7: 31 1.0716 1.071661 73.75 73.63405 33.9
8: 86 1.0386 1.038607 67.50 67.20020 33.7
For the values of density
and density_c
all of them except for case
182 are less than 1.1
.
For the values of height_in
case 48
which is 29.50
and is most likely a data entry error as it is not close to the rest of the data and should most likely be 69.5
.
For the values of ankle_cm
case 31
is data entry that should most likely be 23.9
and the same thing can be said for case 86
which should most likely be 23.7
.
density
values of bodyfat
based on your answers from Problem 6. Change the height_in
value for case 42 to the value you reverse engineered in Problem 6 rounding to the nearest half inch.# Type your code and comments inside the code chunk
# Updating computed bodyfat values and bmi measurements
# Replacing identified typos of density
$density[c(48, 76, 96, 182)] <- c(1.0865, 1.0566, 1.0591, 1.0890)
bodyfat
# Replacing identified typos in height_in
$height_in[42] <- 69.5
bodyfat
# Replacing identified typos in ankle_cm
$ankle_cm[c(86, 31)] <- c(23.7, 23.9) bodyfat
siri_C
, brozek_C
, and bmi_C
, that compute body-fat values rounded to one decimal place according to equations (2), (1), and (4), respectively. Replace the values in siri
and brozek
with the computed values in siri_C
and brozek_C
for case 182.# Type your code and comments inside the code chunk
<- bodyfat %>%
bodyfat mutate(siri_C = round(495/density - 450, 1),
brozek_C = round(457/density - 414.2, 1),
bmi_C = round((weight_lbs*0.453592) /
*2.54/100)^2, 1))
(height_in
182, c("brozek", "siri", "siri_C", "brozek_C")] bodyfat[
brozek siri siri_C brozek_C
1: 0 0 4.5 5.5
$brozek[182] <- bodyfat$brozek_C[182]
bodyfat
$siri[182] <- bodyfat$siri_C[182]
bodyfat
182, c("brozek", "siri", "siri_C", "brozek_C")] bodyfat[
brozek siri siri_C brozek_C
1: 5.5 4.5 4.5 5.5
brozek
and brozek_C
as well as differences between siri
and siri_C
greater than 0.11 to be some type of data entry problem. Use the verbs filter()
and select()
to show the values for case
, siri
, siri_C
, brozek
, brozek_C
, and density
that are considered data entry problems. If both the computed value for siri and brozek differ from the reported values of siri and brozek, it is likely a data entry problem with the density value. If one of either siri or brozek agrees with its computed value, the one that does not agree with the computed value is likely a data entry problem. Which cases seem to have data entry problems and for what reasons?# Type your code and comments inside the code chunk
# Identifying bodyfat typos for brozek and siri
%>%
bodyfat filter(abs(brozek - brozek_C) > 0.11 | abs(siri - siri_C) > 0.11) %>%
select(case, siri, siri_C, brozek, brozek_C, density)
case siri siri_C brozek brozek_C density
1: 6 20.9 21.3 20.6 21.0 1.0502
2: 11 7.1 7.1 7.5 7.8 1.0830
3: 33 11.8 11.8 13.4 12.1 1.0719
4: 49 13.6 13.6 13.4 13.8 1.0678
5: 98 11.3 11.3 11.1 11.7 1.0730
6: 152 19.6 19.6 19.1 19.3 1.0542
7: 169 34.3 36.2 34.7 34.7 1.0180
8: 200 23.6 23.1 22.6 22.6 1.0462
9: 235 25.8 25.8 25.5 25.1 1.0403
10: 237 24.8 24.9 24.0 24.2 1.0424
Cases 169, 200, 237 all have data entry errors with siri
not being the same as siri_C
. Cases 11, 33, 49, 98, 152 all have data entry errors with siri
not being the same as siri_C
. Case 6 is a special case as the data entry errors have something to do with the density
as siri
is not the same as siri_C
and brozek
is not the same as brozek_C
.
bodyfat
data frame. Make sure to update the siri_C
, brozek_C
, and bmi_C
variables in bodyfat
using your corrected values from Problem 9. Write code to verify that there are no differences between brozek
and brozek_C
or any differences between siri
and siri_C
greater than 0.11.# Type your code and comments inside the code chunk
# Replacing bodyfat typos for brozek and siri
$density[c(6, 237)] <- c(1.0514, 1.0426)
bodyfat$brozek[c(11, 33, 49, 98, 152, 235)] <-
bodyfatc(7.8, 12.1, 13.8, 11.7, 19.3, 25.1)
$siri[c(169, 200)] <- c(36.2, 23.1)
bodyfat# Updating siri_C, brozek_C, and bmi_C
<- bodyfat %>%
bodyfat mutate(siri_C = round(495/density - 450, 1),
brozek_C = round(457/density - 414.2, 1),
bmi_C = round((weight_lbs*0.454592) /
*2.54/100)^2, 1))
(height_in# Checking for typos according to given criteria
%>%
bodyfat filter(abs(brozek - brozek_C) > 0.11 | abs(siri - siri_C) > 0.11) %>%
select(case, siri, siri_C, brozek, brozek_C, density)
Empty data.table (0 rows and 6 cols): case,siri,siri_C,brozek,brozek_C,density
siri_C
versus density
with case
mapped to color
and brozek_C
versus density
with case
mapped to color
using the modified bodyfat
data frame. Superimpose equations (2) and (1) over their corresponding scatterplots. Comment on the scatterplots.# interactive plotly scatterplot siri_C vs. density
# Change code below to your plot
library(ggplot2)
library(plotly)
<- ggplot(bodyfat, aes(x = density, y = siri_C, color = case)) + geom_point() + stat_function(fun = fs)
d
<- ggplotly(d)
d2
d2
# interactive plotly scatterplot brozek_C vs. density
# Change code below to your plot
<- ggplot(data = bodyfat, aes(x = density, y = brozek_C, color = case)) + geom_point() + stat_function(fun = fs)
t
<- ggplotly(t)
t2
t2
The scatterplots look to be more linear as there is not as many noticeable data entries.
# Type your code and comments inside the code chunk
# Number of rounding discrepancies for siri
sum(bodyfat$siri != bodyfat$siri_C)
[1] 29
# Number of rounding discrepancies for brozek
sum(bodyfat$brozek != bodyfat$brozek_C)
[1] 39
# Number of rounding discrepancies for bmi
sum(bodyfat$bmi != bodyfat$bmi_C)
[1] 63
For the rounding errors for siri
and brozek
it may be due to the fact that they use the density
factor to calculate the bodyfats.
For the rounding errors for bmi
it may be due to the fact that there was many conversions needed to get the bmi
in lbs
and in
.
Some additional variables I might explore to check the quality of the reported data is biceps_cm
and neck_cm
.
# Code for graph suggested in 13
# Change code below to your plot
<- ggplot(data = bodyfat, aes(x = biceps_cm, y = neck_cm, color = case)) +
b geom_point()
<- ggplotly(b)
b2 b2
bodyfatClean
that excludes the variables brozek
, siri
, and bmi
from the bodyfat
data frame. Use the function write.csv()
or write_csv()
to store the bodyfatClean
object as a CSV file.# writing bodyfat to bodyfatClean.csv
<- bodyfat[, -c(2, 3, 8)] # Using Base R
bodyfatClean write.csv(bodyfatClean, "bodyfatClean.csv", row.names = FALSE)
<- bodyfat %>%
bodyfatClean2 select(-c("brozek", "siri", "bmi")) # Using tidyverse
::write_csv(bodyfatClean2, "bodyfatClean2.csv") readr
This material is released under an Attribution-NonCommercial-ShareAlike 3.0 United States license. Original author: Alan T. Arnholt