Statistics with R Basel R Bootcamp |

from www.popsci.com

In this practical you’ll practice constructing and understanding regression models.

By the end of this practical you will know how to:

- Create simple and multiple linear regression models with
`glm()`

- Explore the outputs of your models, including coefficients, confidence intervals, and p-values.
- Explore the residuals and fitted values of a linear model.
- Use your model to make predictions for new data

- Open your
`BaselRBootcamp`

R project. It should already have the folders`1_Data`

and`2_Code`

. Make sure that the data files listed in the`Datasets`

section above are in your`1_Data`

folder

`# Done!`

- Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called
`LinearModels_I_practical.R`

in the`2_Code`

folder.

`# Done!`

- Using
`library()`

load each of the packages below (if you don’t have them, you’ll need to install them)

```
# Load packages necessary for this script
library(tidyverse)
library(broom)
library(rsq)
```

- Using the following template, load the
`hitters.csv`

data into R and store it as a new object called`hitters`

(Hint: Don’t type the path directly! Use the “tab” completion!).

```
# Load XXX.csv from the 1_Data folder
XX <- read_csv(file = "XX/XX")
```

`hitters <- read_csv("1_Data/hitters.csv")`

```
Parsed with column specification:
cols(
.default = col_double(),
League = col_character(),
Division = col_character(),
NewLeague = col_character()
)
```

`See spec(...) for full column specifications.`

- Take a look at the first few rows of the dataset(s) by printing it/them to the console, it should look like the table below:

AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

215 | 51 | 4 | 19 | 18 | 11 | 1 | 215 | 51 | 4 | 19 | 18 | 11 | A | E | 116 | 5 | 12 | 70 | A |

458 | 114 | 13 | 67 | 57 | 48 | 4 | 1350 | 298 | 28 | 160 | 123 | 122 | A | W | 246 | 389 | 18 | 475 | A |

586 | 159 | 12 | 72 | 79 | 53 | 9 | 3082 | 880 | 83 | 363 | 477 | 295 | N | E | 181 | 13 | 4 | 1043 | N |

568 | 158 | 20 | 89 | 75 | 73 | 15 | 8068 | 2273 | 177 | 1045 | 993 | 732 | N | W | 105 | 290 | 10 | 775 | N |

399 | 102 | 3 | 56 | 34 | 34 | 5 | 670 | 167 | 4 | 89 | 48 | 54 | A | W | 211 | 9 | 3 | 80 | A |

```
# Print hitters object
XXX
```

`hitters`

```
# A tibble: 50 x 20
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 215 51 4 19 18 11 1 215 51 4 19
2 458 114 13 67 57 48 4 1350 298 28 160
3 586 159 12 72 79 53 9 3082 880 83 363
4 568 158 20 89 75 73 15 8068 2273 177 1045
5 399 102 3 56 34 34 5 670 167 4 89
6 579 174 7 67 78 58 6 3053 880 32 366
7 496 119 8 57 33 21 7 3358 882 36 365
8 510 126 2 42 44 35 11 5562 1578 44 703
9 551 171 13 94 83 94 13 6090 1840 128 969
10 313 78 6 32 41 12 12 3742 968 35 409
# … with 40 more rows, and 9 more variables: CRBI <dbl>, CWalks <dbl>,
# League <chr>, Division <chr>, PutOuts <dbl>, Assists <dbl>,
# Errors <dbl>, Salary <dbl>, NewLeague <chr>
```

- Use the
`View()`

function to view the entire dataframe(s) in new window(s).

`View(XXX)`

In this section, we will answer the question: “Is there a relationship between the number of Hits a player had in 1986 (`Hits`

) and his salary in 1987 (`Salary`

)”?

- Using the code below, create a scatterplot showing the relationship between Hits and Salary. Based on this plot, do you expect there to be a relationship between these variables?

```
ggplot(hitters, aes(x = Hits, y = Salary)) +
geom_point(col = "skyblue") +
labs(title = "Baseball player Hits vs Salary",
subtitle = "My first plot in the 'Statistics with R' bootcamp!",
caption = "I love R!!!!")
```

- Using the guide below, conduct a regression analysis predicting a player’s
`Salary`

as a function of his`Hits`

value. Save the result to an object called`hits_glm`

.

```
hits_glm <- glm(formula = XXX ~ XXX,
data = XXX)
```

```
hits_glm <- glm(formula = Salary ~ Hits,
data = hitters)
```

- Look at the class of your
`hits_glm`

object using the`class()`

function. What class is the object?

```
# Print the class of the hits_glm object
class(XXX)
```

`class(hits_glm)`

`[1] "glm" "lm" `

- Print your
`hits_glm`

object to see summary information. What is the value of the coefficient for Hits and what does it mean? Is it consistent with what you expected from the plot?

`hits_glm`

```
Call: glm(formula = Salary ~ Hits, data = hitters)
Coefficients:
(Intercept) Hits
262.89 3.19
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 12600000
Residual Deviance: 11500000 AIC: 765
```

- Use the
`summary()`

function to print additional summary information from your`hits_glm`

object.

```
# Show summary information from hits_glm
summary(XXX)
```

`summary(hits_glm)`

```
Call:
glm(formula = Salary ~ Hits, data = hitters)
Deviance Residuals:
Min 1Q Median 3Q Max
-547 -338 -128 184 1861
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 262.89 176.44 1.49 0.143
Hits 3.19 1.50 2.13 0.038 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 239728)
Null deviance: 12596020 on 49 degrees of freedom
Residual deviance: 11506920 on 48 degrees of freedom
AIC: 765.2
Number of Fisher Scoring iterations: 2
```

- Using the
`tidy()`

function (from the broom package), create a new object called`hits_tidy`

which contains all ‘tidy’ results from your`hits_glm`

object.

```
# Save 'tidy' results from hits_glm as hits_tidy
hits_tidy <- tidy(XXX)
```

`hits_tidy <- tidy(hits_glm)`

- Print your
`hits_tidy`

object, which statistical outputs do you see?

`hits_tidy`

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 263. 176. 1.49 0.143
2 Hits 3.19 1.50 2.13 0.0382
```

- What is the p-value associated with your
`Hits`

coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between hits and salary?

```
# The p-value is 0.0382 which is 'significant' at the 0.05 level.
# We do expect there to be a positive relationship between hits and salary in all players
```

- Using the
`confint()`

function, print 95% confidence interval(s) for the regression coefficient(s) in your`hits_glm`

object. What do they mean?

```
# Print confidence interval(s) of coefficient(s) in hits_glm
confint(XXX)
```

`confint(hits_glm)`

`Waiting for profiling to be done...`

```
2.5 % 97.5 %
(Intercept) -82.931 608.70
Hits 0.257 6.13
```

`# The confidence interval ranges from 0.26 to 6.13`

- Using the
`fitted()`

function, save the fitted values of your`glm`

object as`Hits_fit`

. Then print the results to see how they look.

```
# Save fitted values from Hits_glm as Hits_fit
Hits_fit <- fitted(XXX)
# Print results
Hits_fit
```

```
Hits_fit <- fitted(hits_glm)
Hits_fit
```

```
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
426 627 771 768 589 819 643 665 809 512 764 621 512 595 442 681 432 509
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
563 490 557 266 764 560 435 586 723 637 276 819 387 624 806 637 435 726
37 38 39 40 41 42 43 44 45 46 47 48 49 50
480 777 707 602 736 509 455 937 554 592 509 755 828 525
```

- Using
`min()`

,`max()`

,`mean()`

,`median()`

, and`sd()`

, calculate summary statistics for the fits. Do the values look like they make sense?

```
# Print summary statistics from Hits_fit
min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
```

`min(Hits_fit)`

`[1] 266`

`max(Hits_fit)`

`[1] 937`

`mean(Hits_fit)`

`[1] 609`

`median(Hits_fit)`

`[1] 598`

`sd(Hits_fit)`

`[1] 149`

- Using the code below, plot the relationship between the model fitted values and the true values

```
# Create dataframe containing true salaries and model fits
data <- tibble(truth = hitters$Salary,
fitted = Hits_fit)
# Create scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true Salaries",
subtitle = "Formula = Salary ~ Hits",
caption = "I still love R!!!") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red") +
xlim(c(-300, 3000)) +
ylim(c(-300, 3000))
```

- Using the
`rsq()`

function, print the r-squared value from your`hits_glm`

object. What does this tell you? How much of the variance in Salary can your model account for? Is this high or low?

```
# Print r-squared value from hits_glm object
rsq(XXX)
```

`rsq(hits_glm)`

`[1] 0.0865`

In this section, we will answer the question: "Is there a relationship between the number of home runs a player scores in 1986 his salary in 1987?

- Using the code below, create a scatterplot showing the relationship between HmRun and Salary. Based on this plot, what do you expect the relationship to be between HmRun and Salary?

```
ggplot(hitters, aes(x = HmRun, y = Salary)) +
geom_point(col = "tomato2") +
labs(x = "Home Runs",
y = "Salary",
title = "Predicting Salary from Home Runs",
caption = "Did you know that SPSS stands for 'Shitty Piece of Shitty Shit'?")
```

- Conduct a regression analysis predicting a player’s
`Salary`

as a function of its`HmRun`

value. Save the result to an object called`HmRun_glm`

.

```
# Create HmRun_glm, a glm model predicting Salary as a function of HmRun
HmRun_glm <- glm(formula = XXX ~ XXX,
data = XXX)
```

```
HmRun_glm <- glm(formula = Salary ~ HmRun,
data = hitters)
```

- Look at the class of your
`HmRun_glm`

object using the`class()`

function. What class is the object? Is it ‘glm’?

```
# Print the class of the HmRun_glm object
class(XXX)
```

`class(HmRun_glm)`

`[1] "glm" "lm" `

- Print your
`HmRun_glm`

object to see summary information. What is the value of the beta weight for HmRun and what does it mean? Is it consistent with what you saw in the plot?

`HmRun_glm`

```
Call: glm(formula = Salary ~ HmRun, data = hitters)
Coefficients:
(Intercept) HmRun
503.34 9.55
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance: 12600000
Residual Deviance: 12300000 AIC: 769
```

- Use the
`summary()`

function to print additional summary information from your`HmRun_glm`

object.

```
# Print summary information from HmRun_glm
summary(XXX)
```

`summary(HmRun_glm)`

```
Call:
glm(formula = Salary ~ HmRun, data = hitters)
Deviance Residuals:
Min 1Q Median 3Q Max
-680 -364 -105 184 1624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 503.34 125.28 4.02 0.00021 ***
HmRun 9.55 9.31 1.03 0.30995
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 256784)
Null deviance: 12596020 on 49 degrees of freedom
Residual deviance: 12325615 on 48 degrees of freedom
AIC: 768.7
Number of Fisher Scoring iterations: 2
```

- Using the
`tidy()`

function (from the broom package), create a new object called`HmRun_tidy`

which contains all ‘tidy’ results from your`HmRun_glm`

object.

```
# Save 'tidy' results from HmRun_glm as HmRun_tidy
HmRun_tidy <- tidy(XXX)
```

`HmRun_tidy <- tidy(HmRun_glm)`

- Print your
`HmRun_tidy`

object, which statistical outputs do you see?

`HmRun_tidy`

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 503. 125. 4.02 0.000206
2 HmRun 9.55 9.31 1.03 0.310
```

What is the p-value associated with your

`HmRun`

coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between HmRun and Salary?Using the

`confint()`

function, print a 95% confidence interval for the regression coefficients in your`HmRun_glm`

object. What does this mean?

```
# Print confidence inverval(s) from HmRun_glm
confint(XXX)
```

`confint(HmRun_glm)`

`Waiting for profiling to be done...`

```
2.5 % 97.5 %
(Intercept) 257.80 748.9
HmRun -8.69 27.8
```

- Using the
`fitted()`

function, save the fitted values of your`glm`

object as`HmRun_fit`

. Then print the results to see how they look.

```
# Save fitted values from HmRun_glm as HmRun_fit
HmRun_fit <- fitted(XXX)
# Print results
HmRun_fit
```

```
HmRun_fit <- fitted(HmRun_glm)
HmRun_fit
```

```
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
542 628 618 694 532 570 580 522 628 561 694 675 570 637 561 675 551 637
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
542 532 608 503 656 589 599 618 503 608 513 799 580 694 589 780 522 608
37 38 39 40 41 42 43 44 45 46 47 48 49 50
580 752 503 599 733 551 580 637 618 522 685 561 742 656
```

- Using
`min()`

,`max()`

,`mean()`

,`median()`

, and`sd()`

, calculate summary statistics for the fits. Do the values look like they make sense?

```
# Print summary statitics from HmRun_fit
min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
```

`min(HmRun_fit)`

`[1] 503`

`max(HmRun_fit)`

`[1] 799`

`mean(HmRun_fit)`

`[1] 609`

`median(HmRun_fit)`

`[1] 599`

`sd(HmRun_fit)`

`[1] 74.3`

- Using the code below, plot the relationship between the model fitted values and the true values

```
# Create a dataframe containing true and fitted values
data <- tibble(truth = hitters$Salary,
fitted = HmRun_fit)
# Create scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true salaries",
subtitle = "Formula = Salary ~ HmRun",
x = "True Salaries",
y = "Model Predicted Salaries",
caption = "Have I mentioned how great R is??") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red") +
xlim(c(-300, 3000)) +
ylim(c(-300, 3000))
```

- Using the
`rsq()`

function, print the r-squared value from your`HmRun_glm`

object. What does this tell you?

```
# Print r-squared value from HmRun_glm
rsq(XXX)
```

`rsq(HmRun_glm)`

`[1] 0.0215`

- Create an object called
`all_glm`

which predicts salary based on*all*variables in the`hitters`

dataframe. To do this, use the`formula = Salary ~ .`

notation. Assign the result to an object called`all_glm`

.Use the following guide to help you:

```
# Create all_glm, a glm object modelling Salary based on ALL variables in hitters
all_glm <- glm(formula = XXX ~ .,
data = XXX)
```

```
all_glm <- glm(formula = Salary ~ .,
data = hitters)
```

- Using the
`tidy()`

function (from the broom package), create a new object called`all_tidy`

which contains all ‘tidy’ results from your`all_glm`

object.

```
# Save 'tidy' version of all_glm as all_tidy
all_tidy <- tidy(all_glm)
```

- Print your
`all_tidy`

object.

`all_tidy`

```
# A tibble: 20 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 664. 278. 2.39 0.0235
2 AtBat -1.81 2.15 -0.843 0.406
3 Hits 12.7 9.31 1.37 0.182
4 HmRun 14.0 26.5 0.530 0.600
5 Runs -10.8 9.89 -1.09 0.285
6 RBI -5.19 10.8 -0.479 0.635
7 Walks -8.67 7.75 -1.12 0.272
8 Years -62.4 48.1 -1.30 0.204
9 CAtBat 0.488 0.459 1.06 0.295
10 CHits -4.96 2.80 -1.77 0.0872
11 CHmRun -8.40 8.05 -1.04 0.305
12 CRuns 5.72 2.70 2.12 0.0427
13 CRBI 4.31 3.45 1.25 0.221
14 CWalks 0.375 1.26 0.298 0.768
15 LeagueN 144. 229. 0.629 0.534
16 DivisionW -75.0 158. -0.474 0.639
17 PutOuts -0.0843 0.283 -0.298 0.768
18 Assists 0.593 0.603 0.984 0.333
19 Errors -9.77 12.6 -0.772 0.446
20 NewLeagueN 224. 237. 0.943 0.353
```

How do you interpret the results? Which variables appear to be related to salary?

Using the

`confint()`

function, print a 95% confidence interval for the regression coefficients in your`all_glm`

object. What does this mean?

```
# Print confidence interval(s) of coefficients from all_glm
confint(XXX)
```

`confint(all_glm)`

`Waiting for profiling to be done...`

```
2.5 % 97.5 %
(Intercept) 118.835 1209.199
AtBat -6.016 2.399
Hits -5.517 30.970
HmRun -37.864 65.905
Runs -30.141 8.623
RBI -26.395 16.023
Walks -23.855 6.515
Years -156.593 31.860
CAtBat -0.410 1.387
CHits -10.448 0.538
CHmRun -24.168 7.373
CRuns 0.423 11.010
CRBI -2.446 11.058
CWalks -2.091 2.842
LeagueN -304.834 592.897
DivisionW -385.192 235.145
PutOuts -0.639 0.471
Assists -0.589 1.776
Errors -34.558 15.018
NewLeagueN -241.066 688.462
```

- Using the
`fitted()`

function, save the fitted values of your`glm`

object as`All_fit`

. Then print the results to see how they look.

```
# Save fitted values from All_glm as All_fit
All_fit <- fitted(XXX)
# Print results
All_fit
```

```
All_fit <- fitted(all_glm)
All_fit
```

```
1 2 3 4 5 6 7 8 9 10
409.6 183.3 599.0 1085.5 -18.0 993.0 527.9 872.5 1804.7 623.7
11 12 13 14 15 16 17 18 19 20
1012.5 274.4 170.5 871.7 256.0 359.2 220.4 975.4 914.6 563.1
21 22 23 24 25 26 27 28 29 30
306.8 973.5 978.2 234.7 537.9 846.4 1344.0 531.6 566.3 1094.0
31 32 33 34 35 36 37 38 39 40
445.7 928.8 650.4 -61.7 593.0 414.5 299.8 487.9 354.1 710.2
41 42 43 44 45 46 47 48 49 50
1519.6 508.5 213.4 588.1 478.8 -118.2 492.2 705.0 847.3 269.7
```

- Using
`min()`

,`max()`

,`mean()`

,`median()`

, and`sd()`

, calculate summary statistics for the fits. Do the values look like they make sense?

```
# Print summary statistics from All_fit
min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
```

`min(All_fit)`

`[1] -118`

`max(All_fit)`

`[1] 1805`

`mean(All_fit)`

`[1] 609`

`median(All_fit)`

`[1] 551`

`sd(All_fit)`

`[1] 391`

- Using the code below, plot the relationship between the model fitted values and the true values

```
# Create dataframe containing true and fitted values
data <- tibble(truth = hitters$Salary,
fitted = All_fit)
# Create a scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true Salaries",
subtitle = "Formula = Salary ~ .",
x = "True Salaries",
y = "Model Predicted Salaries") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red") +
xlim(c(-300, 3000)) +
ylim(c(-300, 3000))
```

- Using the
`rsq()`

function, print the r-squared value from your`all_glm`

object. What does this tell you?

```
# Print r-squared value from all_glm
rsq(XXX)
```

`rsq(HmRun_glm)`

`[1] 0.0215`

- Using the
`residuals()`

function, save the residuals of your regression analysis to a vector called`all_resid`

`all_resid <- residuals(XXX)`

`all_resid <- residuals(all_glm)`

Using the

`min()`

,`max()`

,`mean()`

,`median()`

and`sd()`

functions, calculate the summary statistics of the residuals`all_resid`

. Do the values look like they make sense?Using the following code, create a histogram of the residuals from your

`all_glm`

object

```
# Create a dataframe containing absolute errors
data <- tibble(resid = all_resid)
ggplot(data, aes(x = resid)) +
geom_histogram(col = "white") +
geom_vline(xintercept = 0, col = "red") +
labs(title = "Residuals of salary regression model",
x = "Model residual (Predicted - Truth)")
```

- Using the
`abs()`

function, create a new vector called`all_abs_resid`

which contains the*absolute*residuals. Use the following guide

`all_abs_resid <- abs(residuals(XXX))`

`all_abs_resid <- abs(residuals(all_glm))`

- Using
`min()`

,`max()`

,`mean()`

,`median()`

, and`sd()`

, calculate summary statistics for the absolute residuals`all_abs_resid`

. Do the values look like they make sense?

`min(all_abs_resid)`

`[1] 4.66`

`max(all_abs_resid)`

`[1] 1154`

`mean(all_abs_resid)`

`[1] 253`

`median(all_abs_resid)`

`[1] 190`

`sd(all_abs_resid)`

`[1] 197`

- Using the following code, Create a histogram of the absolute residuals
`all_abs_resid`

. How do you interpret this plot?

```
# Create a dataframe containing absolute errors
data <- tibble(abs_resid = all_abs_resid)
ggplot(data, aes(x = abs_resid)) +
geom_histogram(col = "white") +
geom_vline(xintercept = 0, col = "red") +
labs(title = "Absolute Residuals of Salary GLM",
x = "Absolute Model Residuals (smaller is better)")
```

- Using the guides below, create a regression object
`my_glm`

predicting Salary based on all variables*except*for`AtBat`

and`Hits`

```
my_glm <- glm(formula = XXX ~ . - XXX - XXX,
data = XXX)
```

```
my_glm <- glm(formula = Salary ~ . - AtBat - Hits,
data = hitters)
```

- Using the
`tidy()`

function, print tidy results from your`my_glm`

object, do you see that the variables you excluded are indeed not in the model?

`tidy(my_glm)`

```
# A tibble: 18 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 681. 276. 2.47 0.0192
2 HmRun 4.24 23.7 0.179 0.859
3 Runs -1.94 5.27 -0.369 0.714
4 RBI 0.773 8.74 0.0884 0.930
5 Walks -8.19 7.29 -1.12 0.270
6 Years -63.1 47.9 -1.32 0.197
7 CAtBat 0.187 0.390 0.479 0.635
8 CHits -2.58 2.16 -1.19 0.241
9 CHmRun -6.73 7.47 -0.901 0.374
10 CRuns 3.73 2.21 1.68 0.102
11 CRBI 3.05 3.00 1.02 0.317
12 CWalks 0.602 1.21 0.496 0.623
13 LeagueN 210. 222. 0.944 0.352
14 DivisionW -122. 154. -0.794 0.433
15 PutOuts 0.00971 0.260 0.0373 0.970
16 Assists 0.779 0.581 1.34 0.189
17 Errors -14.9 11.7 -1.27 0.212
18 NewLeagueN 102. 220. 0.466 0.645
```

- Using the guides below, create a regression object
`my_glm`

predicting Salary based the*interaction*between`AtBat`

and`Hits`

```
my_glm <- glm(formula = XXX ~ XXX * XXX,
data = XXX)
```

```
my_glm <- glm(formula = Salary ~ AtBat * Hits,
data = hitters)
```

- Using the
`tidy()`

function, print tidy results from your`my_glm`

object, what outputs do you see? Do you see a significant interaction?

`tidy(my_glm)`

```
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1150. 283. 4.06 0.000189
2 AtBat -2.80 1.51 -1.85 0.0713
3 Hits -7.38 7.11 -1.04 0.305
4 AtBat:Hits 0.0274 0.00784 3.49 0.00107
```

- The
`hitters_new.csv`

file contains values from 10 new players, load this data as a new dataframe called`hitters_new`

`hitters_new <- read_csv("XXX/XXX")`

`hitters_new <- read_csv("1_Data/hitters_new.csv")`

```
Parsed with column specification:
cols(
.default = col_double(),
League = col_character(),
Division = col_character(),
NewLeague = col_character()
)
```

`See spec(...) for full column specifications.`

Print the

`hitters_new`

dataframe to see how it looks.Using the following code, print only the values of Salary and hits

```
hitters_new %>%
select(Salary, Hits)
```

You should have already created your

`hits_glm`

object in an earlier question. Print your`hits_glm`

object again to see what the model coefficients are.Based on the coefficients you’ve observed, what should the predicted value of the first player in

`hitters_new`

be? Answer this by calculating the value ‘by hand’ (Hint: Salary = Intercept + Coefficient * Hits)Using your

`hits_glm`

object, predict the salaries of these new players using the`predict()`

function. Save the results to an object called`hitters_new_pred`

:

```
hitters_new_pred <- predict(object = XXX, # glm object
newdata = XXX) # New dataframe
```

```
hitters_new_pred <- predict(object = hits_glm,
newdata = hitters_new)
```

- Print your
`hitters_new_pred`

object, what are these values?

`hitters_new_pred`

```
1 2 3 4 5 6 7 8 9 10
681 512 723 771 266 592 726 525 764 643
```

- Does the first value match the value you calculated ‘by hand’? Aren’t you glad that R can do this for you and you don’t have to constantly calculate values manually?

- A colleague of yours wants to know which variable is better at predicting a player’s salary, RBI (Number of runs batted) or Walks (Number of walks). Conduct the appropriate analysis to answer the question by completing the following steps:
- Visualise the relationships in two separate scatterplots
- Conduct the two separate appropriate analyse(s) using
`glm()`

- Print the test statistic for each coefficient
- Calculate the mean absolute residuals for each model.

- What happens if you run a linear model with two independent variables that are very highly correlated? To answer this, first run the code below to add a new variable to
`hitters`

called`Buns`

which is almost an exact copy of`Runs`

(but with a little bit of random error).

```
# Add a column called Buns which is highly correlated with Runs
hitters_sim <- hitters_sim %>%
mutate(Buns = Runs + rnorm(n = nrow(hitters), mean = 0, sd = 1))
# Visualise the relationship between Runs and Buns
ggplot(data = hitters_sim, aes(x = Runs, y = Buns)) +
geom_point()
```

Once you have run this code, create and analyse the following three separate linear models.

```
# Runs only model
runs_glm <- glm(formula = Salary ~ Runs,
data = hitters)
# Buns only model
buns_glm <- glm(formula = Salary ~ Buns,
data = hitters)
# Runs and Buns model
runs_and_buns_glm <- glm(formula = Salary ~ Runs + Buns,
data = hitters)
```

Compare the outputs of these three models. What do you find? What does this tell you about using highly correlated variables in linear models?

- Look at the coefficients in your
`all_glm`

model, you’ll notice that there is one called`NewLeagueN`

, even though there is no such variable in your`hitters`

dataframe (though there is one called`NewLeague`

). Using the`table()`

function, look at all possible values of`hitters$NewLeague`

. What do you see? Based on this, what do you think the`NewLeagueN`

variable means? Why is it called`NewLeagueN`

?

```
# Regression models in R
library(tidyverse) # For mpg data and tidy data functions
library(broom) # For tidy()
# Q: Does displacement predict gas mileage?
## Visualise the data
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Predicting hwy from displ")
# Model
hwy_mod <- glm(formula = hwy ~ displ,
data = mpg)
# Summary information =================================
summary(hwy_mod)
# Call:
# glm(formula = hwy ~ displ, data = mpg)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -7.1039 -2.1646 -0.2242 2.0589 15.0105
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
# displ -3.5306 0.1945 -18.15 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 14.71478)
#
# Null deviance: 8261.7 on 233 degrees of freedom
# Residual deviance: 3413.8 on 232 degrees of freedom
# AIC: 1297.2
#
# Number of Fisher Scoring iterations: 2
# Tidy results ==========================================
tidy(hwy_mod)
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 35.7 0.720 49.6 2.12e-125
# 2 displ -3.53 0.195 -18.2 2.04e- 46
# (coefficient is negative and p-value is very small!)
# Confidence intervals ==================================
confint(hwy_mod)
# 2.5 % 97.5 %
# (Intercept) 34.285756 37.109546
# displ -3.911829 -3.149349
# Fitted values =========================================
hwy_fit <- fitted(hwy_mod)
hwy_fit
# 1 2 3 4 5
# 29.34259 29.34259 28.63647 28.63647 25.81200
# First 5 fitted values look reasonable
# Residuals =========================================
hwy_resid <- residuals(hwy_mod)
hwy_resid
# 1 2 3 4 5
# -0.3425912 -0.3425912 2.3635266 1.3635266 0.1879976
# First 5 residuals don't look so off
# True vs. Fitted values =================================
data <- tibble(truth = mpg$hwy,
fitted = hwy_fit)
ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true values",
subtitle = "Formula = hwy ~ dipsl") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red", alpha = .4)
# R-squared value =========================================
rsq(hwy_mod)
# [1] 0.5867867
# R-square is 0.586..., meaning that around 59% of the variance in
# hwy can be accounted for by displ
```

File | Rows | Columns |
---|---|---|

hitters.csv | 50 | 20 |

The `hitters.csv`

file contains Major League Baseball Data from the 1986 and 1987 seasons. The data originally come from the `ISLR`

package.

Name | Description |
---|---|

AtBat | Number of times at bat in 1986 |

Hits | Number of hits in 1986 |

HmRun | Number of home runs in 1986 |

Runs | Number of runs in 1986 |

RBI | Number of runs batted in in 1986 |

Walks | Number of walks in 1986 |

Years | Number of years in the major leagues |

CAtBat | Number of times at bat during his career |

CHits | Number of hits during his career |

CHmRun | Number of home runs during his career |

CRuns | Number of runs during his career |

CRBI | Number of runs batted in during his career |

CWalks | Number of walks during his career |

League | A factor with levels A and N indicating player’s league at the end of 1986 |

Division | A factor with levels E and W indicating player’s division at the end of 1986 |

PutOuts | Number of put outs in 1986 |

Assists | Number of assists in 1986 |

Errors | Number of errors in 1986 |

Salary | 1987 annual salary on opening day in thousands of dollars |

NewLeague | A factor with levels A and N indicating player’s league at the beginning of 1987 |

Package | Installation |
---|---|

`tidyverse` |
`install.packages("tidyverse")` |

`broom` |
`install.packages("broom")` |

`rsq` |
`install.packages("rsq")` |

Function | Package | Description |
---|---|---|

`glm` |
`stats` |
Fit a generalized linear model |

`tidy` |
`broom` |
Show most important outputs from a statistical object |

`confint` |
`stats` |
Show confidence intervals of coefficients from a statistical object |

`fitted` |
`stats` |
Show fitted values from a statistical object |

`rsq` |
`rsq` |
Print r-squared value from a statistical object |

`residuals` |
`stats` |
Show residuals (fitted minus true) values from a statistical object |

- Discovering Statistics with R by Field et al is an excellent resource
- YaRrr! The Pirate’s Guide to R has good chapters on statistics with R.