from www.popsci.com

## Overview

In this practical you’ll practice constructing and understanding regression models.

By the end of this practical you will know how to:

1. Create simple and multiple linear regression models with glm()
2. Explore the outputs of your models, including coefficients, confidence intervals, and p-values.
3. Explore the residuals and fitted values of a linear model.
4. Use your model to make predictions for new data

### A - Setup

1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called LinearModels_I_practical.R in the 2_Code folder.
# Done!
1. Using library() load each of the packages below (if you don’t have them, you’ll need to install them)
# Load packages necessary for this script
library(tidyverse)
library(broom)
library(rsq)
1. Using the following template, load the hitters.csv data into R and store it as a new object called hitters (Hint: Don’t type the path directly! Use the “tab” completion!).
# Load XXX.csv from the 1_Data folder

XX <- read_csv(file = "XX/XX")
hitters <- read_csv("1_Data/hitters.csv")
Parsed with column specification:
cols(
.default = col_double(),
League = col_character(),
Division = col_character(),
NewLeague = col_character()
)
See spec(...) for full column specifications.
1. Take a look at the first few rows of the dataset(s) by printing it/them to the console, it should look like the table below:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
215 51 4 19 18 11 1 215 51 4 19 18 11 A E 116 5 12 70 A
458 114 13 67 57 48 4 1350 298 28 160 123 122 A W 246 389 18 475 A
586 159 12 72 79 53 9 3082 880 83 363 477 295 N E 181 13 4 1043 N
568 158 20 89 75 73 15 8068 2273 177 1045 993 732 N W 105 290 10 775 N
399 102 3 56 34 34 5 670 167 4 89 48 54 A W 211 9 3 80 A
# Print hitters object
XXX
hitters
# A tibble: 50 x 20
AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
1   215    51     4    19    18    11     1    215    51      4    19
2   458   114    13    67    57    48     4   1350   298     28   160
3   586   159    12    72    79    53     9   3082   880     83   363
4   568   158    20    89    75    73    15   8068  2273    177  1045
5   399   102     3    56    34    34     5    670   167      4    89
6   579   174     7    67    78    58     6   3053   880     32   366
7   496   119     8    57    33    21     7   3358   882     36   365
8   510   126     2    42    44    35    11   5562  1578     44   703
9   551   171    13    94    83    94    13   6090  1840    128   969
10   313    78     6    32    41    12    12   3742   968     35   409
# … with 40 more rows, and 9 more variables: CRBI <dbl>, CWalks <dbl>,
#   League <chr>, Division <chr>, PutOuts <dbl>, Assists <dbl>,
#   Errors <dbl>, Salary <dbl>, NewLeague <chr>
1. Use the View() function to view the entire dataframe(s) in new window(s).
View(XXX)

### B - Simple regression

##### Hits and Salary

In this section, we will answer the question: “Is there a relationship between the number of Hits a player had in 1986 (Hits) and his salary in 1987 (Salary)”?

1. Using the code below, create a scatterplot showing the relationship between Hits and Salary. Based on this plot, do you expect there to be a relationship between these variables?
ggplot(hitters, aes(x = Hits, y = Salary)) +
geom_point(col = "skyblue") +
labs(title = "Baseball player Hits vs Salary",
subtitle = "My first plot in the 'Statistics with R' bootcamp!",
caption = "I love R!!!!")
1. Using the guide below, conduct a regression analysis predicting a player’s Salary as a function of his Hits value. Save the result to an object called hits_glm.
hits_glm <- glm(formula = XXX ~ XXX,
data = XXX)
hits_glm <- glm(formula = Salary ~ Hits,
data = hitters)
1. Look at the class of your hits_glm object using the class() function. What class is the object?
# Print the class of the hits_glm object

class(XXX)
class(hits_glm)
[1] "glm" "lm" 
1. Print your hits_glm object to see summary information. What is the value of the coefficient for Hits and what does it mean? Is it consistent with what you expected from the plot?
hits_glm

Call:  glm(formula = Salary ~ Hits, data = hitters)

Coefficients:
(Intercept)         Hits
262.89         3.19

Degrees of Freedom: 49 Total (i.e. Null);  48 Residual
Null Deviance:      12600000
Residual Deviance: 11500000     AIC: 765
1. Use the summary() function to print additional summary information from your hits_glm object.
# Show summary information from hits_glm
summary(XXX)
summary(hits_glm)

Call:
glm(formula = Salary ~ Hits, data = hitters)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-547    -338    -128     184    1861

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   262.89     176.44    1.49    0.143
Hits            3.19       1.50    2.13    0.038 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 239728)

Null deviance: 12596020  on 49  degrees of freedom
Residual deviance: 11506920  on 48  degrees of freedom
AIC: 765.2

Number of Fisher Scoring iterations: 2
1. Using the tidy() function (from the broom package), create a new object called hits_tidy which contains all ‘tidy’ results from your hits_glm object.
# Save 'tidy' results from hits_glm as hits_tidy
hits_tidy <- tidy(XXX)
hits_tidy <- tidy(hits_glm)
1. Print your hits_tidy object, which statistical outputs do you see?
hits_tidy
# A tibble: 2 x 5
term        estimate std.error statistic p.value
<chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   263.      176.        1.49  0.143
2 Hits            3.19      1.50      2.13  0.0382
1. What is the p-value associated with your Hits coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between hits and salary?
# The p-value is 0.0382 which is 'significant' at the 0.05 level.
#  We do expect there to be a positive relationship between hits and salary in all players
1. Using the confint() function, print 95% confidence interval(s) for the regression coefficient(s) in your hits_glm object. What do they mean?
# Print confidence interval(s) of coefficient(s) in hits_glm
confint(XXX)
confint(hits_glm)
Waiting for profiling to be done...
              2.5 % 97.5 %
(Intercept) -82.931 608.70
Hits          0.257   6.13
# The confidence interval ranges from 0.26 to 6.13
1. Using the fitted() function, save the fitted values of your glm object as Hits_fit. Then print the results to see how they look.
# Save fitted values from Hits_glm as Hits_fit
Hits_fit <- fitted(XXX)

# Print results
Hits_fit
Hits_fit <- fitted(hits_glm)

Hits_fit
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
426 627 771 768 589 819 643 665 809 512 764 621 512 595 442 681 432 509
19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
563 490 557 266 764 560 435 586 723 637 276 819 387 624 806 637 435 726
37  38  39  40  41  42  43  44  45  46  47  48  49  50
480 777 707 602 736 509 455 937 554 592 509 755 828 525 
1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense?
# Print summary statistics from Hits_fit

min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
min(Hits_fit)
[1] 266
max(Hits_fit)
[1] 937
mean(Hits_fit)
[1] 609
median(Hits_fit)
[1] 598
sd(Hits_fit)
[1] 149
1. Using the code below, plot the relationship between the model fitted values and the true values
# Create dataframe containing true salaries and model fits

data <- tibble(truth = hitters$Salary, fitted = Hits_fit) # Create scatterplot ggplot(data = data, aes(x = truth, y = fitted)) + geom_point() + geom_abline(slope = 1, intercept = 0) + labs(title = "Relationship between model fits and true Salaries", subtitle = "Formula = Salary ~ Hits", caption = "I still love R!!!") + geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted), col = "red") + xlim(c(-300, 3000)) + ylim(c(-300, 3000)) 1. Using the rsq() function, print the r-squared value from your hits_glm object. What does this tell you? How much of the variance in Salary can your model account for? Is this high or low? # Print r-squared value from hits_glm object rsq(XXX) rsq(hits_glm) [1] 0.0865 ##### Home Runs (HmRun) and Salary In this section, we will answer the question: "Is there a relationship between the number of home runs a player scores in 1986 his salary in 1987? 1. Using the code below, create a scatterplot showing the relationship between HmRun and Salary. Based on this plot, what do you expect the relationship to be between HmRun and Salary? ggplot(hitters, aes(x = HmRun, y = Salary)) + geom_point(col = "tomato2") + labs(x = "Home Runs", y = "Salary", title = "Predicting Salary from Home Runs", caption = "Did you know that SPSS stands for 'Shitty Piece of Shitty Shit'?") 1. Conduct a regression analysis predicting a player’s Salary as a function of its HmRun value. Save the result to an object called HmRun_glm. # Create HmRun_glm, a glm model predicting Salary as a function of HmRun HmRun_glm <- glm(formula = XXX ~ XXX, data = XXX) HmRun_glm <- glm(formula = Salary ~ HmRun, data = hitters) 1. Look at the class of your HmRun_glm object using the class() function. What class is the object? Is it ‘glm’? # Print the class of the HmRun_glm object class(XXX) class(HmRun_glm) [1] "glm" "lm"  1. Print your HmRun_glm object to see summary information. What is the value of the beta weight for HmRun and what does it mean? Is it consistent with what you saw in the plot? HmRun_glm  Call: glm(formula = Salary ~ HmRun, data = hitters) Coefficients: (Intercept) HmRun 503.34 9.55 Degrees of Freedom: 49 Total (i.e. Null); 48 Residual Null Deviance: 12600000 Residual Deviance: 12300000 AIC: 769 1. Use the summary() function to print additional summary information from your HmRun_glm object. # Print summary information from HmRun_glm summary(XXX) summary(HmRun_glm)  Call: glm(formula = Salary ~ HmRun, data = hitters) Deviance Residuals: Min 1Q Median 3Q Max -680 -364 -105 184 1624 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 503.34 125.28 4.02 0.00021 *** HmRun 9.55 9.31 1.03 0.30995 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be 256784) Null deviance: 12596020 on 49 degrees of freedom Residual deviance: 12325615 on 48 degrees of freedom AIC: 768.7 Number of Fisher Scoring iterations: 2 1. Using the tidy() function (from the broom package), create a new object called HmRun_tidy which contains all ‘tidy’ results from your HmRun_glm object. # Save 'tidy' results from HmRun_glm as HmRun_tidy HmRun_tidy <- tidy(XXX) HmRun_tidy <- tidy(HmRun_glm) 1. Print your HmRun_tidy object, which statistical outputs do you see? HmRun_tidy # A tibble: 2 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) 503. 125. 4.02 0.000206 2 HmRun 9.55 9.31 1.03 0.310  1. What is the p-value associated with your HmRun coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between HmRun and Salary? 2. Using the confint() function, print a 95% confidence interval for the regression coefficients in your HmRun_glm object. What does this mean? # Print confidence inverval(s) from HmRun_glm confint(XXX) confint(HmRun_glm) Waiting for profiling to be done...  2.5 % 97.5 % (Intercept) 257.80 748.9 HmRun -8.69 27.8 1. Using the fitted() function, save the fitted values of your glm object as HmRun_fit. Then print the results to see how they look. # Save fitted values from HmRun_glm as HmRun_fit HmRun_fit <- fitted(XXX) # Print results HmRun_fit HmRun_fit <- fitted(HmRun_glm) HmRun_fit  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 542 628 618 694 532 570 580 522 628 561 694 675 570 637 561 675 551 637 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 542 532 608 503 656 589 599 618 503 608 513 799 580 694 589 780 522 608 37 38 39 40 41 42 43 44 45 46 47 48 49 50 580 752 503 599 733 551 580 637 618 522 685 561 742 656  1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense? # Print summary statitics from HmRun_fit min(XXX) max(XXX) mean(XXX) median(XXX) sd(XXX) min(HmRun_fit) [1] 503 max(HmRun_fit) [1] 799 mean(HmRun_fit) [1] 609 median(HmRun_fit) [1] 599 sd(HmRun_fit) [1] 74.3 1. Using the code below, plot the relationship between the model fitted values and the true values # Create a dataframe containing true and fitted values data <- tibble(truth = hitters$Salary,
fitted  = HmRun_fit)

# Create scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true salaries",
subtitle = "Formula = Salary ~ HmRun",
x = "True Salaries",
y = "Model Predicted Salaries",
caption = "Have I mentioned how great R is??") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red") +
xlim(c(-300, 3000)) +
ylim(c(-300, 3000))
1. Using the rsq() function, print the r-squared value from your HmRun_glm object. What does this tell you?
# Print r-squared value from HmRun_glm
rsq(XXX)
rsq(HmRun_glm)
[1] 0.0215

### C - Multiple Regression

1. Create an object called all_glm which predicts salary based on all variables in the hitters dataframe. To do this, use the formula = Salary ~ . notation. Assign the result to an object called all_glm.Use the following guide to help you:
# Create all_glm, a glm object modelling Salary based on ALL variables in hitters
all_glm <- glm(formula = XXX ~ .,
data = XXX)
all_glm <- glm(formula = Salary ~ .,
data = hitters)
1. Using the tidy() function (from the broom package), create a new object called all_tidy which contains all ‘tidy’ results from your all_glm object.
# Save 'tidy' version of all_glm as all_tidy
all_tidy <- tidy(all_glm)
1. Print your all_tidy object.
all_tidy
# A tibble: 20 x 5
term         estimate std.error statistic p.value
<chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  664.       278.        2.39   0.0235
2 AtBat         -1.81       2.15     -0.843  0.406
3 Hits          12.7        9.31      1.37   0.182
4 HmRun         14.0       26.5       0.530  0.600
5 Runs         -10.8        9.89     -1.09   0.285
6 RBI           -5.19      10.8      -0.479  0.635
7 Walks         -8.67       7.75     -1.12   0.272
8 Years        -62.4       48.1      -1.30   0.204
9 CAtBat         0.488      0.459     1.06   0.295
10 CHits         -4.96       2.80     -1.77   0.0872
11 CHmRun        -8.40       8.05     -1.04   0.305
12 CRuns          5.72       2.70      2.12   0.0427
13 CRBI           4.31       3.45      1.25   0.221
14 CWalks         0.375      1.26      0.298  0.768
15 LeagueN      144.       229.        0.629  0.534
16 DivisionW    -75.0      158.       -0.474  0.639
17 PutOuts       -0.0843     0.283    -0.298  0.768
18 Assists        0.593      0.603     0.984  0.333
19 Errors        -9.77      12.6      -0.772  0.446
20 NewLeagueN   224.       237.        0.943  0.353 
1. How do you interpret the results? Which variables appear to be related to salary?

2. Using the confint() function, print a 95% confidence interval for the regression coefficients in your all_glm object. What does this mean?

# Print confidence interval(s) of coefficients from all_glm
confint(XXX)
confint(all_glm)
Waiting for profiling to be done...
               2.5 %   97.5 %
(Intercept)  118.835 1209.199
AtBat         -6.016    2.399
Hits          -5.517   30.970
HmRun        -37.864   65.905
Runs         -30.141    8.623
RBI          -26.395   16.023
Walks        -23.855    6.515
Years       -156.593   31.860
CAtBat        -0.410    1.387
CHits        -10.448    0.538
CHmRun       -24.168    7.373
CRuns          0.423   11.010
CRBI          -2.446   11.058
CWalks        -2.091    2.842
LeagueN     -304.834  592.897
DivisionW   -385.192  235.145
PutOuts       -0.639    0.471
Assists       -0.589    1.776
Errors       -34.558   15.018
NewLeagueN  -241.066  688.462
1. Using the fitted() function, save the fitted values of your glm object as All_fit. Then print the results to see how they look.
# Save fitted values from All_glm as All_fit
All_fit <- fitted(XXX)

# Print results
All_fit
All_fit <- fitted(all_glm)

All_fit
     1      2      3      4      5      6      7      8      9     10
409.6  183.3  599.0 1085.5  -18.0  993.0  527.9  872.5 1804.7  623.7
11     12     13     14     15     16     17     18     19     20
1012.5  274.4  170.5  871.7  256.0  359.2  220.4  975.4  914.6  563.1
21     22     23     24     25     26     27     28     29     30
306.8  973.5  978.2  234.7  537.9  846.4 1344.0  531.6  566.3 1094.0
31     32     33     34     35     36     37     38     39     40
445.7  928.8  650.4  -61.7  593.0  414.5  299.8  487.9  354.1  710.2
41     42     43     44     45     46     47     48     49     50
1519.6  508.5  213.4  588.1  478.8 -118.2  492.2  705.0  847.3  269.7 
1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense?
# Print summary statistics from All_fit

min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
min(All_fit)
[1] -118
max(All_fit)
[1] 1805
mean(All_fit)
[1] 609
median(All_fit)
[1] 551
sd(All_fit)
[1] 391
1. Using the code below, plot the relationship between the model fitted values and the true values
# Create dataframe containing true and fitted values

## Examples

# Regression models in R

library(tidyverse)  # For mpg data and tidy data functions
library(broom)      # For tidy()

# Q: Does displacement predict gas mileage?

## Visualise the data

ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Predicting hwy from displ")

# Model

hwy_mod <- glm(formula = hwy ~ displ,
data = mpg)

# Summary information =================================

summary(hwy_mod)

# Call:
# glm(formula = hwy ~ displ, data = mpg)
#
# Deviance Residuals:
#     Min       1Q   Median       3Q      Max
# -7.1039  -2.1646  -0.2242   2.0589  15.0105
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
# displ        -3.5306     0.1945  -18.15   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for gaussian family taken to be 14.71478)
#
#     Null deviance: 8261.7  on 233  degrees of freedom
# Residual deviance: 3413.8  on 232  degrees of freedom
# AIC: 1297.2
#
# Number of Fisher Scoring iterations: 2

# Tidy results ==========================================

tidy(hwy_mod)

#   term        estimate std.error statistic   p.value
#   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
# 1 (Intercept)    35.7      0.720      49.6 2.12e-125
# 2 displ          -3.53     0.195     -18.2 2.04e- 46

#  (coefficient is negative and p-value is very small!)

# Confidence intervals ==================================

confint(hwy_mod)

#                 2.5 %    97.5 %
# (Intercept) 34.285756 37.109546
# displ       -3.911829 -3.149349

# Fitted values =========================================

hwy_fit <- fitted(hwy_mod)

hwy_fit

#       1        2        3        4        5
# 29.34259 29.34259 28.63647 28.63647 25.81200

#  First 5 fitted values look reasonable

# Residuals =========================================

hwy_resid <- residuals(hwy_mod)

hwy_resid

#          1          2          3          4          5
# -0.3425912 -0.3425912  2.3635266  1.3635266  0.1879976

#  First 5 residuals don't look so off

# True vs. Fitted values =================================

data <- tibble(truth = mpg\$hwy,
fitted  = hwy_fit)

ggplot(data = data, aes(x = truth, y = fitted)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Relationship between model fits and true values",
subtitle = "Formula = hwy ~ dipsl") +
geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted),
col = "red", alpha = .4)

# R-squared value =========================================

rsq(hwy_mod)

# [1] 0.5867867

# R-square is 0.586..., meaning that around 59% of the variance in
#  hwy can be accounted for by displ

## Datasets

File Rows Columns
hitters.csv 50 20

The hitters.csv file contains Major League Baseball Data from the 1986 and 1987 seasons. The data originally come from the ISLR package.

#### Variable description

Name Description
AtBat Number of times at bat in 1986
Hits Number of hits in 1986
HmRun Number of home runs in 1986
Runs Number of runs in 1986
RBI Number of runs batted in in 1986
Walks Number of walks in 1986
Years Number of years in the major leagues
CAtBat Number of times at bat during his career
CHits Number of hits during his career
CHmRun Number of home runs during his career
CRuns Number of runs during his career
CRBI Number of runs batted in during his career
CWalks Number of walks during his career
League A factor with levels A and N indicating player’s league at the end of 1986
Division A factor with levels E and W indicating player’s division at the end of 1986
PutOuts Number of put outs in 1986
Assists Number of assists in 1986
Errors Number of errors in 1986
Salary 1987 annual salary on opening day in thousands of dollars
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987

## Functions

### Packages

Package Installation
tidyverse install.packages("tidyverse")
broom install.packages("broom")
rsq install.packages("rsq")

### Functions

Function Package Description
glm stats Fit a generalized linear model
tidy broom Show most important outputs from a statistical object
confint stats Show confidence intervals of coefficients from a statistical object
fitted stats Show fitted values from a statistical object
rsq rsq Print r-squared value from a statistical object
residuals stats Show residuals (fitted minus true) values from a statistical object