Trulli
from www.popsci.com

Overview

In this practical you’ll practice constructing and understanding regression models.

By the end of this practical you will know how to:

  1. Create simple and multiple linear regression models with glm()
  2. Explore the outputs of your models, including coefficients, confidence intervals, and p-values.
  3. Explore the residuals and fitted values of a linear model.
  4. Use your model to make predictions for new data

Tasks

A - Setup

  1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called LinearModels_I_practical.R in the 2_Code folder.
# Done!
  1. Using library() load each of the packages below (if you don’t have them, you’ll need to install them)
# Load packages necessary for this script
library(tidyverse)
library(broom)
library(rsq)
  1. Using the following template, load the hitters.csv data into R and store it as a new object called hitters (Hint: Don’t type the path directly! Use the “tab” completion!).
# Load XXX.csv from the 1_Data folder

XX <- read_csv(file = "XX/XX")
hitters <- read_csv("1_Data/hitters.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  League = col_character(),
  Division = col_character(),
  NewLeague = col_character()
)
See spec(...) for full column specifications.
  1. Take a look at the first few rows of the dataset(s) by printing it/them to the console, it should look like the table below:
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
215 51 4 19 18 11 1 215 51 4 19 18 11 A E 116 5 12 70 A
458 114 13 67 57 48 4 1350 298 28 160 123 122 A W 246 389 18 475 A
586 159 12 72 79 53 9 3082 880 83 363 477 295 N E 181 13 4 1043 N
568 158 20 89 75 73 15 8068 2273 177 1045 993 732 N W 105 290 10 775 N
399 102 3 56 34 34 5 670 167 4 89 48 54 A W 211 9 3 80 A
# Print hitters object
XXX
hitters
# A tibble: 50 x 20
   AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
 1   215    51     4    19    18    11     1    215    51      4    19
 2   458   114    13    67    57    48     4   1350   298     28   160
 3   586   159    12    72    79    53     9   3082   880     83   363
 4   568   158    20    89    75    73    15   8068  2273    177  1045
 5   399   102     3    56    34    34     5    670   167      4    89
 6   579   174     7    67    78    58     6   3053   880     32   366
 7   496   119     8    57    33    21     7   3358   882     36   365
 8   510   126     2    42    44    35    11   5562  1578     44   703
 9   551   171    13    94    83    94    13   6090  1840    128   969
10   313    78     6    32    41    12    12   3742   968     35   409
# … with 40 more rows, and 9 more variables: CRBI <dbl>, CWalks <dbl>,
#   League <chr>, Division <chr>, PutOuts <dbl>, Assists <dbl>,
#   Errors <dbl>, Salary <dbl>, NewLeague <chr>
  1. Use the View() function to view the entire dataframe(s) in new window(s).
View(XXX)

B - Simple regression

Hits and Salary

In this section, we will answer the question: “Is there a relationship between the number of Hits a player had in 1986 (Hits) and his salary in 1987 (Salary)”?

  1. Using the code below, create a scatterplot showing the relationship between Hits and Salary. Based on this plot, do you expect there to be a relationship between these variables?
ggplot(hitters, aes(x = Hits, y = Salary)) +
  geom_point(col = "skyblue") +
  labs(title = "Baseball player Hits vs Salary",
       subtitle = "My first plot in the 'Statistics with R' bootcamp!",
       caption = "I love R!!!!")
  1. Using the guide below, conduct a regression analysis predicting a player’s Salary as a function of his Hits value. Save the result to an object called hits_glm.
hits_glm <- glm(formula = XXX ~ XXX,
                data = XXX)
hits_glm <- glm(formula = Salary ~ Hits,
                data = hitters)
  1. Look at the class of your hits_glm object using the class() function. What class is the object?
# Print the class of the hits_glm object

class(XXX)
class(hits_glm)
[1] "glm" "lm" 
  1. Print your hits_glm object to see summary information. What is the value of the coefficient for Hits and what does it mean? Is it consistent with what you expected from the plot?
hits_glm

Call:  glm(formula = Salary ~ Hits, data = hitters)

Coefficients:
(Intercept)         Hits  
     262.89         3.19  

Degrees of Freedom: 49 Total (i.e. Null);  48 Residual
Null Deviance:      12600000 
Residual Deviance: 11500000     AIC: 765
  1. Use the summary() function to print additional summary information from your hits_glm object.
# Show summary information from hits_glm
summary(XXX)
summary(hits_glm)

Call:
glm(formula = Salary ~ Hits, data = hitters)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
  -547    -338    -128     184    1861  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   262.89     176.44    1.49    0.143  
Hits            3.19       1.50    2.13    0.038 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 239728)

    Null deviance: 12596020  on 49  degrees of freedom
Residual deviance: 11506920  on 48  degrees of freedom
AIC: 765.2

Number of Fisher Scoring iterations: 2
  1. Using the tidy() function (from the broom package), create a new object called hits_tidy which contains all ‘tidy’ results from your hits_glm object.
# Save 'tidy' results from hits_glm as hits_tidy
hits_tidy <- tidy(XXX)
hits_tidy <- tidy(hits_glm)
  1. Print your hits_tidy object, which statistical outputs do you see?
hits_tidy
# A tibble: 2 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   263.      176.        1.49  0.143 
2 Hits            3.19      1.50      2.13  0.0382
  1. What is the p-value associated with your Hits coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between hits and salary?
# The p-value is 0.0382 which is 'significant' at the 0.05 level.
#  We do expect there to be a positive relationship between hits and salary in all players
  1. Using the confint() function, print 95% confidence interval(s) for the regression coefficient(s) in your hits_glm object. What do they mean?
# Print confidence interval(s) of coefficient(s) in hits_glm
confint(XXX)
confint(hits_glm)
Waiting for profiling to be done...
              2.5 % 97.5 %
(Intercept) -82.931 608.70
Hits          0.257   6.13
# The confidence interval ranges from 0.26 to 6.13
  1. Using the fitted() function, save the fitted values of your glm object as Hits_fit. Then print the results to see how they look.
# Save fitted values from Hits_glm as Hits_fit
Hits_fit <- fitted(XXX)

# Print results
Hits_fit
Hits_fit <- fitted(hits_glm)

Hits_fit
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
426 627 771 768 589 819 643 665 809 512 764 621 512 595 442 681 432 509 
 19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
563 490 557 266 764 560 435 586 723 637 276 819 387 624 806 637 435 726 
 37  38  39  40  41  42  43  44  45  46  47  48  49  50 
480 777 707 602 736 509 455 937 554 592 509 755 828 525 
  1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense?
# Print summary statistics from Hits_fit

min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
min(Hits_fit)
[1] 266
max(Hits_fit)
[1] 937
mean(Hits_fit)
[1] 609
median(Hits_fit)
[1] 598
sd(Hits_fit)
[1] 149
  1. Using the code below, plot the relationship between the model fitted values and the true values
# Create dataframe containing true salaries and model fits

data <- tibble(truth = hitters$Salary,
               fitted  = Hits_fit)

# Create scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +  
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Relationship between model fits and true Salaries",
       subtitle = "Formula = Salary ~ Hits",
       caption = "I still love R!!!") +
  geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted), 
               col = "red") +
  xlim(c(-300, 3000)) +
  ylim(c(-300, 3000))
  1. Using the rsq() function, print the r-squared value from your hits_glm object. What does this tell you? How much of the variance in Salary can your model account for? Is this high or low?
# Print r-squared value from hits_glm object
rsq(XXX)
rsq(hits_glm)
[1] 0.0865
Home Runs (HmRun) and Salary

In this section, we will answer the question: "Is there a relationship between the number of home runs a player scores in 1986 his salary in 1987?

  1. Using the code below, create a scatterplot showing the relationship between HmRun and Salary. Based on this plot, what do you expect the relationship to be between HmRun and Salary?
ggplot(hitters, aes(x = HmRun, y = Salary)) +
  geom_point(col = "tomato2") +
  labs(x = "Home Runs",
       y = "Salary",
       title = "Predicting Salary from Home Runs",
       caption = "Did you know that SPSS stands for 'Shitty Piece of Shitty Shit'?")
  1. Conduct a regression analysis predicting a player’s Salary as a function of its HmRun value. Save the result to an object called HmRun_glm.
# Create HmRun_glm, a glm model predicting Salary as a function of HmRun
HmRun_glm <- glm(formula = XXX ~ XXX,
                 data = XXX)
HmRun_glm <- glm(formula = Salary ~ HmRun,
                 data = hitters)
  1. Look at the class of your HmRun_glm object using the class() function. What class is the object? Is it ‘glm’?
# Print the class of the HmRun_glm object

class(XXX)
class(HmRun_glm)
[1] "glm" "lm" 
  1. Print your HmRun_glm object to see summary information. What is the value of the beta weight for HmRun and what does it mean? Is it consistent with what you saw in the plot?
HmRun_glm

Call:  glm(formula = Salary ~ HmRun, data = hitters)

Coefficients:
(Intercept)        HmRun  
     503.34         9.55  

Degrees of Freedom: 49 Total (i.e. Null);  48 Residual
Null Deviance:      12600000 
Residual Deviance: 12300000     AIC: 769
  1. Use the summary() function to print additional summary information from your HmRun_glm object.
# Print summary information from HmRun_glm
summary(XXX)
summary(HmRun_glm)

Call:
glm(formula = Salary ~ HmRun, data = hitters)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
  -680    -364    -105     184    1624  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   503.34     125.28    4.02  0.00021 ***
HmRun           9.55       9.31    1.03  0.30995    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 256784)

    Null deviance: 12596020  on 49  degrees of freedom
Residual deviance: 12325615  on 48  degrees of freedom
AIC: 768.7

Number of Fisher Scoring iterations: 2
  1. Using the tidy() function (from the broom package), create a new object called HmRun_tidy which contains all ‘tidy’ results from your HmRun_glm object.
# Save 'tidy' results from HmRun_glm as HmRun_tidy
HmRun_tidy <- tidy(XXX)
HmRun_tidy <- tidy(HmRun_glm)
  1. Print your HmRun_tidy object, which statistical outputs do you see?
HmRun_tidy
# A tibble: 2 x 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   503.      125.        4.02 0.000206
2 HmRun           9.55      9.31      1.03 0.310   
  1. What is the p-value associated with your HmRun coefficient? How do you interpret this? Do you think that, in the general population of players, that there is a non-zero relationship between HmRun and Salary?

  2. Using the confint() function, print a 95% confidence interval for the regression coefficients in your HmRun_glm object. What does this mean?

# Print confidence inverval(s) from HmRun_glm
confint(XXX)
confint(HmRun_glm)
Waiting for profiling to be done...
             2.5 % 97.5 %
(Intercept) 257.80  748.9
HmRun        -8.69   27.8
  1. Using the fitted() function, save the fitted values of your glm object as HmRun_fit. Then print the results to see how they look.
# Save fitted values from HmRun_glm as HmRun_fit
HmRun_fit <- fitted(XXX)

# Print results
HmRun_fit
HmRun_fit <- fitted(HmRun_glm)

HmRun_fit
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
542 628 618 694 532 570 580 522 628 561 694 675 570 637 561 675 551 637 
 19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
542 532 608 503 656 589 599 618 503 608 513 799 580 694 589 780 522 608 
 37  38  39  40  41  42  43  44  45  46  47  48  49  50 
580 752 503 599 733 551 580 637 618 522 685 561 742 656 
  1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense?
# Print summary statitics from HmRun_fit

min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
min(HmRun_fit)
[1] 503
max(HmRun_fit)
[1] 799
mean(HmRun_fit)
[1] 609
median(HmRun_fit)
[1] 599
sd(HmRun_fit)
[1] 74.3
  1. Using the code below, plot the relationship between the model fitted values and the true values
# Create a dataframe containing true and fitted values
data <- tibble(truth = hitters$Salary,
               fitted  = HmRun_fit)

# Create scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +  
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Relationship between model fits and true salaries",
       subtitle = "Formula = Salary ~ HmRun",
       x = "True Salaries",
       y = "Model Predicted Salaries",
       caption = "Have I mentioned how great R is??") +
  geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted), 
               col = "red") +
  xlim(c(-300, 3000)) +
  ylim(c(-300, 3000))
  1. Using the rsq() function, print the r-squared value from your HmRun_glm object. What does this tell you?
# Print r-squared value from HmRun_glm
rsq(XXX)
rsq(HmRun_glm)
[1] 0.0215

C - Multiple Regression

  1. Create an object called all_glm which predicts salary based on all variables in the hitters dataframe. To do this, use the formula = Salary ~ . notation. Assign the result to an object called all_glm.Use the following guide to help you:
# Create all_glm, a glm object modelling Salary based on ALL variables in hitters
all_glm <- glm(formula = XXX ~ .,
              data = XXX)
all_glm <- glm(formula = Salary ~ .,
                 data = hitters)
  1. Using the tidy() function (from the broom package), create a new object called all_tidy which contains all ‘tidy’ results from your all_glm object.
# Save 'tidy' version of all_glm as all_tidy
all_tidy <- tidy(all_glm)
  1. Print your all_tidy object.
all_tidy
# A tibble: 20 x 5
   term         estimate std.error statistic p.value
   <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  664.       278.        2.39   0.0235
 2 AtBat         -1.81       2.15     -0.843  0.406 
 3 Hits          12.7        9.31      1.37   0.182 
 4 HmRun         14.0       26.5       0.530  0.600 
 5 Runs         -10.8        9.89     -1.09   0.285 
 6 RBI           -5.19      10.8      -0.479  0.635 
 7 Walks         -8.67       7.75     -1.12   0.272 
 8 Years        -62.4       48.1      -1.30   0.204 
 9 CAtBat         0.488      0.459     1.06   0.295 
10 CHits         -4.96       2.80     -1.77   0.0872
11 CHmRun        -8.40       8.05     -1.04   0.305 
12 CRuns          5.72       2.70      2.12   0.0427
13 CRBI           4.31       3.45      1.25   0.221 
14 CWalks         0.375      1.26      0.298  0.768 
15 LeagueN      144.       229.        0.629  0.534 
16 DivisionW    -75.0      158.       -0.474  0.639 
17 PutOuts       -0.0843     0.283    -0.298  0.768 
18 Assists        0.593      0.603     0.984  0.333 
19 Errors        -9.77      12.6      -0.772  0.446 
20 NewLeagueN   224.       237.        0.943  0.353 
  1. How do you interpret the results? Which variables appear to be related to salary?

  2. Using the confint() function, print a 95% confidence interval for the regression coefficients in your all_glm object. What does this mean?

# Print confidence interval(s) of coefficients from all_glm
confint(XXX)
confint(all_glm)
Waiting for profiling to be done...
               2.5 %   97.5 %
(Intercept)  118.835 1209.199
AtBat         -6.016    2.399
Hits          -5.517   30.970
HmRun        -37.864   65.905
Runs         -30.141    8.623
RBI          -26.395   16.023
Walks        -23.855    6.515
Years       -156.593   31.860
CAtBat        -0.410    1.387
CHits        -10.448    0.538
CHmRun       -24.168    7.373
CRuns          0.423   11.010
CRBI          -2.446   11.058
CWalks        -2.091    2.842
LeagueN     -304.834  592.897
DivisionW   -385.192  235.145
PutOuts       -0.639    0.471
Assists       -0.589    1.776
Errors       -34.558   15.018
NewLeagueN  -241.066  688.462
  1. Using the fitted() function, save the fitted values of your glm object as All_fit. Then print the results to see how they look.
# Save fitted values from All_glm as All_fit
All_fit <- fitted(XXX)

# Print results
All_fit
All_fit <- fitted(all_glm)

All_fit
     1      2      3      4      5      6      7      8      9     10 
 409.6  183.3  599.0 1085.5  -18.0  993.0  527.9  872.5 1804.7  623.7 
    11     12     13     14     15     16     17     18     19     20 
1012.5  274.4  170.5  871.7  256.0  359.2  220.4  975.4  914.6  563.1 
    21     22     23     24     25     26     27     28     29     30 
 306.8  973.5  978.2  234.7  537.9  846.4 1344.0  531.6  566.3 1094.0 
    31     32     33     34     35     36     37     38     39     40 
 445.7  928.8  650.4  -61.7  593.0  414.5  299.8  487.9  354.1  710.2 
    41     42     43     44     45     46     47     48     49     50 
1519.6  508.5  213.4  588.1  478.8 -118.2  492.2  705.0  847.3  269.7 
  1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the fits. Do the values look like they make sense?
# Print summary statistics from All_fit

min(XXX)
max(XXX)
mean(XXX)
median(XXX)
sd(XXX)
min(All_fit)
[1] -118
max(All_fit)
[1] 1805
mean(All_fit)
[1] 609
median(All_fit)
[1] 551
sd(All_fit)
[1] 391
  1. Using the code below, plot the relationship between the model fitted values and the true values
# Create dataframe containing true and fitted values
data <- tibble(truth = hitters$Salary,
               fitted  = All_fit)

# Create a scatterplot
ggplot(data = data, aes(x = truth, y = fitted)) +  
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Relationship between model fits and true Salaries",
       subtitle = "Formula = Salary ~ .",
       x = "True Salaries",
       y = "Model Predicted Salaries") +
  geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted), 
               col = "red") +
  xlim(c(-300, 3000)) +
  ylim(c(-300, 3000))
  1. Using the rsq() function, print the r-squared value from your all_glm object. What does this tell you?
# Print r-squared value from all_glm
rsq(XXX)
rsq(HmRun_glm)
[1] 0.0215

D - Exploring model residuals

  1. Using the residuals() function, save the residuals of your regression analysis to a vector called all_resid
all_resid <- residuals(XXX)
all_resid <- residuals(all_glm)
  1. Using the min(), max(), mean(), median() and sd() functions, calculate the summary statistics of the residuals all_resid. Do the values look like they make sense?

  2. Using the following code, create a histogram of the residuals from your all_glm object

# Create a dataframe containing absolute errors
data <- tibble(resid = all_resid)

ggplot(data, aes(x = resid)) + 
  geom_histogram(col = "white") +
  geom_vline(xintercept = 0, col = "red") +
  labs(title = "Residuals of salary regression model",
       x = "Model residual (Predicted - Truth)")
  1. Using the abs() function, create a new vector called all_abs_resid which contains the absolute residuals. Use the following guide
all_abs_resid <- abs(residuals(XXX))
all_abs_resid <- abs(residuals(all_glm))
  1. Using min(), max(), mean(), median(), and sd(), calculate summary statistics for the absolute residuals all_abs_resid. Do the values look like they make sense?
min(all_abs_resid)
[1] 4.66
max(all_abs_resid)
[1] 1154
mean(all_abs_resid)
[1] 253
median(all_abs_resid)
[1] 190
sd(all_abs_resid)
[1] 197
  1. Using the following code, Create a histogram of the absolute residuals all_abs_resid. How do you interpret this plot?
# Create a dataframe containing absolute errors
data <- tibble(abs_resid = all_abs_resid)

ggplot(data, aes(x = abs_resid)) + 
  geom_histogram(col = "white") +
  geom_vline(xintercept = 0, col = "red") +
  labs(title = "Absolute Residuals of Salary GLM",
       x = "Absolute Model Residuals (smaller is better)")

E - Adjusting your model with formulas

Excluding variables with -
  1. Using the guides below, create a regression object my_glm predicting Salary based on all variables except for AtBat and Hits
my_glm <- glm(formula = XXX ~ . - XXX - XXX,
              data = XXX)
my_glm <- glm(formula = Salary ~ . - AtBat - Hits,
              data = hitters)
  1. Using the tidy() function, print tidy results from your my_glm object, do you see that the variables you excluded are indeed not in the model?
tidy(my_glm)
# A tibble: 18 x 5
   term          estimate std.error statistic p.value
   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)  681.        276.       2.47    0.0192
 2 HmRun          4.24       23.7      0.179   0.859 
 3 Runs          -1.94        5.27    -0.369   0.714 
 4 RBI            0.773       8.74     0.0884  0.930 
 5 Walks         -8.19        7.29    -1.12    0.270 
 6 Years        -63.1        47.9     -1.32    0.197 
 7 CAtBat         0.187       0.390    0.479   0.635 
 8 CHits         -2.58        2.16    -1.19    0.241 
 9 CHmRun        -6.73        7.47    -0.901   0.374 
10 CRuns          3.73        2.21     1.68    0.102 
11 CRBI           3.05        3.00     1.02    0.317 
12 CWalks         0.602       1.21     0.496   0.623 
13 LeagueN      210.        222.       0.944   0.352 
14 DivisionW   -122.        154.      -0.794   0.433 
15 PutOuts        0.00971     0.260    0.0373  0.970 
16 Assists        0.779       0.581    1.34    0.189 
17 Errors       -14.9        11.7     -1.27    0.212 
18 NewLeagueN   102.        220.       0.466   0.645 
Including interactions with *
  1. Using the guides below, create a regression object my_glm predicting Salary based the interaction between AtBat and Hits
my_glm <- glm(formula = XXX ~  XXX * XXX,
              data = XXX)
my_glm <- glm(formula = Salary ~ AtBat * Hits,
              data = hitters)
  1. Using the tidy() function, print tidy results from your my_glm object, what outputs do you see? Do you see a significant interaction?
tidy(my_glm)
# A tibble: 4 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  1150.     283.           4.06 0.000189
2 AtBat          -2.80     1.51        -1.85 0.0713  
3 Hits           -7.38     7.11        -1.04 0.305   
4 AtBat:Hits      0.0274   0.00784      3.49 0.00107 

F - Predicting values for new data

  1. The hitters_new.csv file contains values from 10 new players, load this data as a new dataframe called hitters_new
hitters_new <- read_csv("XXX/XXX")
hitters_new <- read_csv("1_Data/hitters_new.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  League = col_character(),
  Division = col_character(),
  NewLeague = col_character()
)
See spec(...) for full column specifications.
  1. Print the hitters_new dataframe to see how it looks.

  2. Using the following code, print only the values of Salary and hits

hitters_new %>%
  select(Salary, Hits)
  1. You should have already created your hits_glm object in an earlier question. Print your hits_glm object again to see what the model coefficients are.

  2. Based on the coefficients you’ve observed, what should the predicted value of the first player in hitters_new be? Answer this by calculating the value ‘by hand’ (Hint: Salary = Intercept + Coefficient * Hits)

  3. Using your hits_glm object, predict the salaries of these new players using the predict() function. Save the results to an object called hitters_new_pred:

hitters_new_pred <- predict(object = XXX,    # glm object
                            newdata = XXX)   # New dataframe
hitters_new_pred <- predict(object = hits_glm, 
        newdata = hitters_new)
  1. Print your hitters_new_pred object, what are these values?
hitters_new_pred
  1   2   3   4   5   6   7   8   9  10 
681 512 723 771 266 592 726 525 764 643 
  1. Does the first value match the value you calculated ‘by hand’? Aren’t you glad that R can do this for you and you don’t have to constantly calculate values manually?

X - Challenges

  1. A colleague of yours wants to know which variable is better at predicting a player’s salary, RBI (Number of runs batted) or Walks (Number of walks). Conduct the appropriate analysis to answer the question by completing the following steps:
    • Visualise the relationships in two separate scatterplots
    • Conduct the two separate appropriate analyse(s) using glm()
    • Print the test statistic for each coefficient
    • Calculate the mean absolute residuals for each model.
  2. What happens if you run a linear model with two independent variables that are very highly correlated? To answer this, first run the code below to add a new variable to hitters called Buns which is almost an exact copy of Runs (but with a little bit of random error).
# Add a column called Buns which is highly correlated with Runs

hitters_sim <- hitters_sim %>%
  mutate(Buns = Runs + rnorm(n = nrow(hitters), mean = 0, sd = 1))

# Visualise the relationship between Runs and Buns

ggplot(data = hitters_sim, aes(x = Runs, y = Buns)) +
  geom_point()

Once you have run this code, create and analyse the following three separate linear models.

# Runs only model
runs_glm <- glm(formula = Salary ~ Runs,
                data = hitters)

# Buns only model

buns_glm <- glm(formula = Salary ~ Buns,
                data = hitters)

# Runs and Buns model
runs_and_buns_glm <- glm(formula = Salary ~ Runs + Buns,
                         data = hitters)

Compare the outputs of these three models. What do you find? What does this tell you about using highly correlated variables in linear models?

  1. Look at the coefficients in your all_glm model, you’ll notice that there is one called NewLeagueN, even though there is no such variable in your hitters dataframe (though there is one called NewLeague). Using the table() function, look at all possible values of hitters$NewLeague. What do you see? Based on this, what do you think the NewLeagueN variable means? Why is it called NewLeagueN?

Examples

# Regression models in R

library(tidyverse)  # For mpg data and tidy data functions
library(broom)      # For tidy()

# Q: Does displacement predict gas mileage?

## Visualise the data

ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point() +
  labs(title = "Predicting hwy from displ")

# Model

hwy_mod <- glm(formula = hwy ~ displ,
               data = mpg)

# Summary information =================================

summary(hwy_mod)

# Call:
# glm(formula = hwy ~ displ, data = mpg)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -7.1039  -2.1646  -0.2242   2.0589  15.0105  
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
# displ        -3.5306     0.1945  -18.15   <2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for gaussian family taken to be 14.71478)
# 
#     Null deviance: 8261.7  on 233  degrees of freedom
# Residual deviance: 3413.8  on 232  degrees of freedom
# AIC: 1297.2
# 
# Number of Fisher Scoring iterations: 2

# Tidy results ==========================================

tidy(hwy_mod)

#   term        estimate std.error statistic   p.value
#   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
# 1 (Intercept)    35.7      0.720      49.6 2.12e-125
# 2 displ          -3.53     0.195     -18.2 2.04e- 46

#  (coefficient is negative and p-value is very small!)

# Confidence intervals ==================================

confint(hwy_mod)

#                 2.5 %    97.5 %
# (Intercept) 34.285756 37.109546
# displ       -3.911829 -3.149349


# Fitted values =========================================

hwy_fit <- fitted(hwy_mod)

hwy_fit

#       1        2        3        4        5 
# 29.34259 29.34259 28.63647 28.63647 25.81200 

#  First 5 fitted values look reasonable

# Residuals =========================================

hwy_resid <- residuals(hwy_mod)

hwy_resid

#          1          2          3          4          5 
# -0.3425912 -0.3425912  2.3635266  1.3635266  0.1879976 

#  First 5 residuals don't look so off

# True vs. Fitted values =================================

data <- tibble(truth = mpg$hwy,
               fitted  = hwy_fit)

ggplot(data = data, aes(x = truth, y = fitted)) +  
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Relationship between model fits and true values",
       subtitle = "Formula = hwy ~ dipsl") +
  geom_segment(aes(x = truth, y = truth, xend = truth, yend = fitted), 
               col = "red", alpha = .4) 


# R-squared value =========================================

rsq(hwy_mod)

# [1] 0.5867867

# R-square is 0.586..., meaning that around 59% of the variance in
#  hwy can be accounted for by displ

Datasets

File Rows Columns
hitters.csv 50 20

The hitters.csv file contains Major League Baseball Data from the 1986 and 1987 seasons. The data originally come from the ISLR package.

Variable description

Name Description
AtBat Number of times at bat in 1986
Hits Number of hits in 1986
HmRun Number of home runs in 1986
Runs Number of runs in 1986
RBI Number of runs batted in in 1986
Walks Number of walks in 1986
Years Number of years in the major leagues
CAtBat Number of times at bat during his career
CHits Number of hits during his career
CHmRun Number of home runs during his career
CRuns Number of runs during his career
CRBI Number of runs batted in during his career
CWalks Number of walks during his career
League A factor with levels A and N indicating player’s league at the end of 1986
Division A factor with levels E and W indicating player’s division at the end of 1986
PutOuts Number of put outs in 1986
Assists Number of assists in 1986
Errors Number of errors in 1986
Salary 1987 annual salary on opening day in thousands of dollars
NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987

Functions

Packages

Package Installation
tidyverse install.packages("tidyverse")
broom install.packages("broom")
rsq install.packages("rsq")

Functions

Function Package Description
glm stats Fit a generalized linear model
tidy broom Show most important outputs from a statistical object
confint stats Show confidence intervals of coefficients from a statistical object
fitted stats Show fitted values from a statistical object
rsq rsq Print r-squared value from a statistical object
residuals stats Show residuals (fitted minus true) values from a statistical object

Resources

Books