Trulli
from FinanceMonthly.com

Overview

In this practical you’ll practice specific hypothesis testing extensions of the linear model

By the end of this practical you will know how to:

  1. Conduct a t-test (one sample, two sample, and paired)
  2. Conduct a correlation test and compare to glm
  3. Conduct an ANOVA and compare to glm
  4. Select and run the appopriate test for the appropriate research question.

Tasks

A - Setup

  1. Open your BaselRBootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called LinearModels_II_practical.R in the 2_Code folder.
# Done!
  1. Using library() load the packages below (if you don’t have them, you’ll need to install them first).
# Load packages necessary for this script
library(tidyverse)
library(broom)
library(rsq)
  1. Using the following template, load the trial_ACTG175.csv data into R and store it as a new object called trial (Hint: Don’t type the path directly! Use the “tab” completion!).
# Load trial_ACTG175.csv from the 1_Data folder as a new object called trial

XX <- read_csv(file = "XX/XX")
# Load trial_act.csv from the 1_Data folder

trial <- read_csv(file = "1_Data/trial_ACTG175.csv")
Parsed with column specification:
cols(
  patient_id = col_double(),
  molecule = col_character(),
  days = col_double(),
  age = col_double(),
  weight_kg = col_double(),
  drug_history = col_character(),
  race = col_character(),
  gender = col_character(),
  symptomatic = col_character(),
  cd4_baseline = col_double(),
  cd4_96_weeks = col_double(),
  cd8_baseline = col_double(),
  cd8_20_weeks = col_double()
)
  1. Take a look at the first few rows of the dataset(s) by printing it/them to the console, it should look like the table below:
patient_id molecule days age weight_kg drug_history race gender symptomatic cd4_baseline cd4_96_weeks cd8_baseline cd8_20_weeks
10056 Zidovudine and Zalcitabine 948 48 89.8 No White Female Asymptomatic 422 660 566 324
10059 Didanosine 1002 61 49.4 No White Female Asymptomatic 162 NA 392 564
10089 Didanosine 961 45 88.5 Yes White Male Asymptomatic 326 122 2063 1893
10093 Didanosine 1166 47 85.3 No White Male Asymptomatic 287 NA 1590 966
10124 Zidovudine 1090 43 66.7 No White Male Asymptomatic 504 660 870 782
# Print trial object
XXX
trial
# A tibble: 2,139 x 13
   patient_id molecule  days   age weight_kg drug_history race  gender
        <dbl> <chr>    <dbl> <dbl>     <dbl> <chr>        <chr> <chr> 
 1      10056 Zidovud…   948    48      89.8 No           White Female
 2      10059 Didanos…  1002    61      49.4 No           White Female
 3      10089 Didanos…   961    45      88.5 Yes          White Male  
 4      10093 Didanos…  1166    47      85.3 No           White Male  
 5      10124 Zidovud…  1090    43      66.7 No           White Male  
 6      10140 zidovud…  1181    46      88.9 Yes          White Male  
 7      10165 Zidovud…   794    31      73.0 No           White Male  
 8      10190 Zidovud…   957    41      66.2 Yes          White Male  
 9      10198 Didanos…   198    40      82.6 No           White Male  
10      10229 Zidovud…   188    35      78.0 No           White Male  
# … with 2,129 more rows, and 5 more variables: symptomatic <chr>,
#   cd4_baseline <dbl>, cd4_96_weeks <dbl>, cd8_baseline <dbl>,
#   cd8_20_weeks <dbl>
  1. Use the summary() function to print more details on the columns of the datasets.
# Show summary information from trial
summary(XXX)
summary(trial)
   patient_id       molecule              days           age      
 Min.   : 10056   Length:2139        Min.   :  14   Min.   :12.0  
 1st Qu.: 81446   Class :character   1st Qu.: 727   1st Qu.:29.0  
 Median :190566   Mode  :character   Median : 997   Median :34.0  
 Mean   :248778                      Mean   : 879   Mean   :35.2  
 3rd Qu.:280277                      3rd Qu.:1091   3rd Qu.:40.0  
 Max.   :990077                      Max.   :1231   Max.   :70.0  
                                                                  
   weight_kg     drug_history           race              gender         
 Min.   : 31.0   Length:2139        Length:2139        Length:2139       
 1st Qu.: 66.7   Class :character   Class :character   Class :character  
 Median : 74.4   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 75.1                                                           
 3rd Qu.: 82.6                                                           
 Max.   :159.9                                                           
                                                                         
 symptomatic         cd4_baseline   cd4_96_weeks   cd8_baseline 
 Length:2139        Min.   :   0   Min.   :   0   Min.   :  40  
 Class :character   1st Qu.: 264   1st Qu.: 209   1st Qu.: 654  
 Mode  :character   Median : 340   Median : 321   Median : 893  
                    Mean   : 351   Mean   : 329   Mean   : 987  
                    3rd Qu.: 423   3rd Qu.: 440   3rd Qu.:1207  
                    Max.   :1199   Max.   :1190   Max.   :5011  
                                   NA's   :797                  
  cd8_20_weeks 
 Min.   : 124  
 1st Qu.: 632  
 Median : 865  
 Mean   : 935  
 3rd Qu.:1146  
 Max.   :6035  
               
  1. Use the View() function to view the entire dataframe(s) in window(s)
# View trial dataset in new window
View(XXX)
View(trial)

B - Correlation

Research Question: Is there a correlation between a patient’s age and their CD4 T cell count at baseline?

  1. Using the code below, create a plot showing the relationship between age and CD4 T cell count at baseline. Based on this plot, do you expect to find a relationship between the two variables? (Hint: See all of the possible named colors in R by running colors(). Then change the color of your histogram to your favorite one!)
ggplot(data = trial, mapping = aes(x = age, y = cd4_baseline)) + 
      geom_point(alpha = .2, col = "darkblue") +
  labs(title = "ACTG175 Trial",
       subtitle = "Baseline age and CD4 T cell count at baseline ",
       x = "Age (age)",
       y = "CD4 T cell count at baseline (cd4_baseline)")
  1. Using the cor.test() function, create an object called age_cd4_baseline_test which conducts a correlation test between a patient’s age at baseline (age) and their CD4 T cell count at baseline (cd4_baseline)
# Correlation between patient's age at baseline and CD4 T cell count at baseline

XXX <- cor.test(formula = ~ XXX + XXX,
                data = XXX)
# Correlation between patient's age at baseline and CD4 T cell count at baseline

age_cd4_baseline_test <- cor.test(formula = ~ age + cd4_baseline,
                          data = trial)
  1. Print the age_cd4_baseline_test object to see summary outputs.
age_cd4_baseline_test

    Pearson's product-moment correlation

data:  age and cd4_baseline
t = -2, df = 2000, p-value = 0.06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.08254  0.00208
sample estimates:
    cor 
-0.0403 
  1. Apply the names() function to your age_cd4_baseline_test object to see all of its named elements
# See all named elements of age_cd4_baseline_test
names(XXX)
# See all named elements of age_cd4_baseline_test
names(age_cd4_baseline_test)
[1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
[6] "alternative" "method"      "data.name"   "conf.int"   
  1. Using the XXX$XXX notation, print only the correlation estimate (estimate) in your age_cd4_baseline_test object. What does this value tell you?
# Print correlation coefficient estimate in the age_cd4_baseline_test object
XXX$XXX
# Print correlation coefficient estimate
age_cd4_baseline_test$estimate
    cor 
-0.0403 
  1. Again using the XXX$XXX notation, print only the p value (p.value) in your age_cd4_baseline_test object. What does this value tell you?
# Print the p value of the correlation coefficient in the age_cd4_baseline_test object
XXX$XXX
# Print the p value in the age_cd4_baseline_test object
age_cd4_baseline_test$conf.int
[1] -0.08254  0.00208
attr(,"conf.level")
[1] 0.95
  1. Again using the XXX$XXX notation, print only the confidence interval (conf.int) in your age_cd4_baseline_test object. What does this value tell you?
# Print the confidence interval of the correlation coefficient in the age_cd4_baseline_test object
XXX$XXX
# Print the confidence interval in the age_cd4_baseline_test object
age_cd4_baseline_test$conf.int
[1] -0.08254  0.00208
attr(,"conf.level")
[1] 0.95
  1. Using the tidy() function (from the broom package), create a ‘tidy’ version of your age_cd4_baseline_test object called age_cd4_baseline_tidy
# Create a 'tidy' version of my age_cd4_baseline_test object

age_cd4_baseline_tidy <- tidy(xxx)
# Create a 'tidy' version of my age_cd4_baseline_test object

age_cd4_baseline_tidy <- tidy(age_cd4_baseline_test)
  1. Now, print your age_cd4_baseline_tidy object, how do the results look?
age_cd4_baseline_tidy
# A tibble: 1 x 8
  estimate statistic p.value parameter conf.low conf.high method
     <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr> 
1  -0.0403     -1.86  0.0624      2137  -0.0825   0.00208 Pears…
# … with 1 more variable: alternative <chr>
Correlation as Linear Model
  1. Conduct the same analysis you did before, but this time use the glm() function. Use the code skeleton below to help you. Do you get the same results as you did using cor.test()?
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline 
age_cd4_baseline_glm <- glm(formula = XXX ~ XXX,
                            data = XXX)

# Print 'tidy' results
tidy(XXX)
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline 
age_cd4_baseline_glm <- glm(formula = cd4_baseline ~ age,
                    data = trial)

# Print 'tidy' results
tidy(age_cd4_baseline_glm)
# A tibble: 2 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  370.       10.7       34.6  8.09e-209
2 age           -0.549     0.294     -1.86 6.24e-  2

C - T-test

Research Question: Do patients with a history of intravenous drug use have higher CD4 T cell counts at baseline compared to patients without a history of intravenous drug use?

  1. Using the code below, create a plot showing the distribution of CD4 T cell counts at baseline (cd4_baseline) depending on whether or not patients were symptomatic. Based on this plot, do you expect to find that symptomatic patients generally have higher (or lower) CD4 T cell counts?
ggplot(data = trial, mapping = aes(x = symptomatic, y = cd4_baseline, col = symptomatic)) + 
  stat_summary(geom = "bar", fun.y = "mean", fill = "white", col = "black") +
  geom_jitter(width = .1, alpha = .1) +
  labs(title = "ACTG175 Trial",
       subtitle = "Baseline CD4 T cell count and symptoms",
       x = "Are patients symptomatic? (symptomatic)",
       y = "CD4 T cell count at baseline (cd4_baseline)") +
  guides(col = FALSE)
  1. Using the t.test() function, create an object called symptomatic_cd4_baseline_test which conducts a two-sample t-test comparing patients’ CD4 T cell count at baseline (cd4_baseline) between patients that are symptomatic and those that are not (symptomatic)
# T-test comparing CD4 T cell counts at baseline bewteen symptomatic and non-symptomatic patients

XXX <- t.test(formula = XXX ~ XXX,
              data = XXX)
# T-test comparing CD4 T cell counts at baseline bewteen symptomatic and non-symptomatic patients

symptomatic_cd4_baseline_test <- t.test(formula = cd4_baseline ~ symptomatic,
                           data = trial)
  1. Print the symptomatic_cd4_baseline_test object to see summary outputs.
age_cd4_baseline_test

    Pearson's product-moment correlation

data:  age and cd4_baseline
t = -2, df = 2000, p-value = 0.06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.08254  0.00208
sample estimates:
    cor 
-0.0403 
  1. Apply the names() function to your symptomatic_cd4_baseline_test object to see all of its named elements
# See all named elements of symptom_cd4_baseline_test
names(XXX)
# See all named elements of symptomatic_cd4_baseline_test
names(symptomatic_cd4_baseline_test)
[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  
  1. Using the XXX$XXX notation, print only the correlation estimate (estimate) in your symptomatic_cd4_baseline_test object. What does the result tell you?
# Print parameter estimates in the symptomatic_cd4_baseline_test object
XXX$XXX
# Print parameter estimates in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$estimate
mean in group Asymptomatic  mean in group Symptomatic 
                       358                        317 
  1. Again using the XXX$XXX notation, print only the p value (p.value) in your symptomatic_cd4_baseline_test object. What does this value tell you?
# Print the p value of the symptomatic_cd4_baseline_test object
XXX$XXX
# Print the p value in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$p.value
[1] 1.66e-10
  1. Again using the XXX$XXX notation, print only the confidence interval (conf.int) in your symptomatic_cd4_baseline_test object. What does this value tell you?
# Print the confidence interval in the symptomatic_cd4_baseline_test object
XXX$XXX
# Print the confidence interval in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$conf.int
[1] 28.7 53.5
attr(,"conf.level")
[1] 0.95
  1. Using the tidy() function (from the broom package), create a ‘tidy’ version of your symptomatic_cd4_baseline_test object called symptomatic_cd4_baseline_tidy
# Create a 'tidy' version of my symptomatic_cd4_baseline_test object

symptomatic_cd4_baseline_tidy <- tidy(xxx)
# Create a 'tidy' version of my symptomatic_cd4_baseline_test object

symptomatic_cd4_baseline_tidy <- tidy(symptomatic_cd4_baseline_test)
  1. Now, print your symptomatic_cd4_baseline_tidy object, how do the results look?
symptomatic_cd4_baseline_tidy
# A tibble: 1 x 10
  estimate estimate1 estimate2 statistic  p.value parameter conf.low
     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1     41.1      358.      317.      6.51 1.66e-10      573.     28.7
# … with 3 more variables: conf.high <dbl>, method <chr>,
#   alternative <chr>
T-test as Linear Model
  1. Conduct the same analysis you did before, but this time use the glm() function. Use the code skeleton below to help you. Do you get the same results as you did using t.test()? Which outputs are the same and which are different?
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline 
symptomatic_cd4_baseline_glm <- glm(formula = XXX ~ XXX,
                                    data = XXX)

# Print 'tidy' results
tidy(XXX)
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline 
symptomatic_cd4_baseline_glm <- glm(formula = cd4_baseline ~ symptomatic,
                        data = trial)

# Print 'tidy' results
tidy(symptomatic_cd4_baseline_glm)
# A tibble: 2 x 5
  term                   estimate std.error statistic       p.value
  <chr>                     <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)               358.       2.80    128.   0            
2 symptomaticSymptomatic    -41.1      6.72     -6.11 0.00000000119

D - ANOVA

Research Question: Was there a difference between the treatment molecule in the number of days until the first occurrence of one of the following major negative events: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death?

  1. Using the code below, create a plot showing the distribution of the number of days until a major negative event (days) for each treatment molecule molecule Based on this plot, do you expect to find that there were differences in treatment molecules?
ggplot(data = trial, mapping = aes(x = molecule, y = days, col = molecule)) + 
  stat_summary(geom = "bar", fun.y = "mean", fill = "white", col = "black") +
  geom_jitter(width = .2, alpha = .2) +
  labs(title = "ACTG175 Trial",
       subtitle = "Did the molecule affect days until negative events?",
       x = "Treatment Arm (molecule)",
       y = "Number of days until a major negative event (days)") +
  guides(col = FALSE)
  1. Using the aov() function, create an object called days_molecule_aov which conducts an ANOVA comparing the number of days until a negative event (days) between treatment molecule (molecule).
# ANOVA comparing number of days until a negative event between treatment molecule

XXX <- aov(formula = XXX ~ XXX,
              data = XXX)
# ANOVA comparing number of days until a negative event between treatment molecule

days_molecule_aov <- aov(formula = days ~ molecule,
                      data = trial)
  1. Print the days_molecule_aov object to see summary outputs.
days_molecule_aov
Call:
   aov(formula = days ~ molecule, data = trial)

Terms:
                molecule Residuals
Sum of Squares  4.43e+06  1.78e+08
Deg. of Freedom        3      2135

Residual standard error: 289
Estimated effects may be unbalanced
  1. Apply the names() function to your days_molecule_aov object to see all of its named elements
# See all named elements of days_molecule_aov
names(XXX)
# See all named elements of symptom_cd4_baseline_test
names(days_molecule_aov)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "contrasts"     "xlevels"       "call"          "terms"        
[13] "model"        
  1. Using the XXX$XXX notation, print the coefficients in your days_molecule_aov object. What do these values mean?
# Print coefficients in the days_molecule_aov object
XXX$XXX
# Print coefficients in the days_molecule_aov object
days_molecule_aov$coefficients
                       (Intercept)                 moleculeZidovudine 
                             893.5                              -92.2 
 moleculezidovudine and Didanosine moleculeZidovudine and Zalcitabine 
                              22.8                               12.3 
  1. Using the tidy() function (from the broom package), create a ‘tidy’ version of your days_molecule_aov object called days_molecule_tidy
# Create a 'tidy' version of my symptom_cd4_baseline_test days_molecule_aov

days_molecule_tidy <- tidy(xxx)
# Create a 'tidy' version of my days_molecule_aov object

days_molecule_tidy <- tidy(days_molecule_aov)
  1. Now, print your days_molecule_tidy object, how do the results look?
days_molecule_tidy
# A tibble: 2 x 6
  term         df      sumsq   meansq statistic   p.value
  <chr>     <dbl>      <dbl>    <dbl>     <dbl>     <dbl>
1 molecule      3   4433564. 1477855.      17.7  2.37e-11
2 Residuals  2135 178203547.   83468.      NA   NA       
Post-hoc tests
  1. In order to compare pair-wise differences between groups after conducting an ANOVA, you need to conduct post-hoc tests. Using the TukeyHSD() function, conduct pair-wise tests to see which pairs of treatment molecule differ in days. Save the result to an object called days_molecule_aov_post `
# Conduct pair-wise tests comparing every pair of treatment
#  molecule (molecule) on days until a major negative event (days)

XXX <- TukeyHSD(XXX)
# Conduct pair-wise tests comparing every pair of treatment
#  molecule (molecule) on days until a major negative event (days)

days_molecule_aov_post <- TukeyHSD(days_molecule_aov)
  1. Print your days_molecule_aov_post object, what do you see?

  2. Print ‘tidy’ results from your days_molecule_aov_post object using the tidy() function

# Print 'tidy' version of days_molecule_aov_post
tidy(XXX)
  1. Based on what you’ve seen, what are your conclusions? Which treatment molecule seem to differ?
ANOVA as Linear Model
  1. Conduct the same analysis you did before, but this time use the glm() function. Use the code skeleton below to help you. Do you get the same results as you did using aov()? Which outputs are the same and which are different?
# GLM predicting number of days until a major negative event (days) as a function of treatment arm (molecule)
XXX <- glm(formula = XXX ~ XXX,
           data = XXX)

# Print object
XXX

# Print 'tidy' results
tidy(XXX)
# GLM predicting number of days until a major negative event (days) as a function of treatment arm (molecule)
days_molecule_glm <- glm(formula = days ~ molecule,
                     data = trial)

# Print object
days_molecule_glm

Call:  glm(formula = days ~ molecule, data = trial)

Coefficients:
                       (Intercept)                  moleculeZidovudine  
                             893.5                               -92.2  
 moleculezidovudine and Didanosine  moleculeZidovudine and Zalcitabine  
                              22.8                                12.3  

Degrees of Freedom: 2138 Total (i.e. Null);  2135 Residual
Null Deviance:      1.83e+08 
Residual Deviance: 1.78e+08     AIC: 30300
# Print 'tidy' results
tidy(days_molecule_glm)
# A tibble: 4 x 5
  term                              estimate std.error statistic    p.value
  <chr>                                <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)                          893.       12.2    73.2      0.     
2 moleculeZidovudine                   -92.2      17.5    -5.28     1.45e-7
3 moleculezidovudine and Didanosine     22.8      17.6     1.30     1.95e-1
4 moleculeZidovudine and Zalcitabi…     12.3      17.6     0.699    4.85e-1

F - Regression

  1. It’s time to conduct a regression analysis predicting day the number of days until a major negative event days based on several variables. Perform a regression analysis predicting days until a major negative event days based on the following variables: molecule, gender, age, weight_kg, drug_history, race, cd4_baseline, and cd8_baseline. Use the code chunk below to help you
# Predict days based on many variables

days_glm <- glm(formula = XXX ~ XXX + XXX + XXX ...,
                data = XXX)
# Predict days based on many variables

days_glm <- glm(formula = days ~ molecule + gender + age + weight_kg + drug_history + race + cd4_baseline + cd8_baseline,
                data = trial)
  1. Using tidy() look at the summary results. Which variables seem to be able to predict the number of days until a major negative event?
tidy(XXX)
tidy(days_glm)
# A tibble: 11 x 5
   term                               estimate std.error statistic  p.value
   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                         6.84e+2   46.1       14.9   1.51e-47
 2 moleculeZidovudine                 -9.50e+1   17.1       -5.55  3.23e- 8
 3 moleculezidovudine and Didanosine   2.25e+1   17.2        1.31  1.91e- 1
 4 moleculeZidovudine and Zalcitabine  9.59e+0   17.2        0.558 5.77e- 1
 5 genderMale                          9.36e+0   17.6        0.531 5.96e- 1
 6 age                                 1.16e+0    0.715      1.62  1.06e- 1
 7 weight_kg                           6.92e-2    0.481      0.144 8.85e- 1
 8 drug_historyYes                    -1.88e+1   18.4       -1.02  3.06e- 1
 9 raceWhite                           2.56e+1   14.2        1.80  7.14e- 2
10 cd4_baseline                        5.15e-1    0.0530     9.72  6.97e-22
11 cd8_baseline                       -3.96e-2    0.0131    -3.01  2.62e- 3
  1. Using rsq() calculate the r-squared value of your model. Was it high or low? Recall that r-squared can range anywhere from 0 (meaning the model explains none of the variance in days), all the way up to 1 (meaning the model expllains all of the variance in days).
rsq(XXX)
rsq(days_glm)
[1] 0.0695
  1. Using the code below, visualise the residuals from the model. What does this mean?
# Save residuals as days_glm_resid
days_glm_resid <- resid(days_glm)

# Plot histogram
ggplot(data = tibble(days_glm_resid), aes(x = days_glm_resid)) +
  geom_histogram(col = "white", fill = "skyblue") +
  geom_vline(xintercept = 0) +
  labs(title = "Residuals of regression model predicting days",
       x = "Absolute residuals abs(Fit - Truth)",
       y = 'Frequency')
  1. Using the code below, calculate the mean residuals from the model. What do these mean?
# Calculate the mean residuals of the model
mean(resid(XXX))
  1. Using the code below, visualise the absolute residuals from the model. What does this mean?
# Save residuals as days_glm_resid
days_glm_absresid <- abs(resid(days_glm))

# Plot histogram
ggplot(data = tibble(days_glm_absresid), aes(x = days_glm_absresid)) +
  geom_histogram(col = "white", fill = "skyblue") +
  geom_vline(xintercept = 0) +
  labs(title = "Absolute residuals of regression model predicting days",
       x = "Absolute residuals abs(Fit - Truth)",
       y = 'Frequency')
  1. Using the code below, calculate the mean absolute residuals from the model. What do these mean?
# Calculate the mean absolute residuals of the model
mean(abs(resid(XXX)))

X - Advanced: Choose your own test!

For the remaining tasks, you choose the test! For each task, do the following:

  • Visualize the data (use the code given in the sections above for help!)
  • Conduct the appropriate test using an applied hypothesis test function (i.e.; correlation test, t-test, ANOVA or chi-square test).
  • Explore the results (e.g.; estimate, confidence interval, p-value).
  • Conduct the same test using the generalized linear model function glm()
  • Explore the results and confirm you get the same result from your applied function!
  1. Do people with a history of intravenous drug use (drug_history) tend to have major negative events (days) faster than those without a history of intravenous drug use?

  2. Do men and women (gender) have different CD8 T cell counts at baseline (cd8_baseline)?

  3. Is there a relationship between patient’s CD4 T cell counts at baseline (cd4_baseline) and at 96 weeks (cd4_96_weeks)?

  4. Did older patients have higher CD4 T cell counts at baseline (cd4_baseline) compared to younger patients?

  5. Was there an effect of treatment arm (molecule) on the change of CD4 T cell counts between baseline (cd4_baseline) and 96 weeks (`cd4_96_weeks)? Hint: to do this, you first need to create a column that shows the change in CD4 T cell counts. The code below should help you with this:

trial <- trial %>%
  mutate(cd4_96_weeks_change = cd4_96_weeks - cd4_baseline)   # Add cd4_96_weeks_change to trial, indicating change from
                                        # in cd4 T cell count from baseline to 96 weeks.

Y - Advanced: Chi-square test

Research Question: Was there an unequal distribution of gender in the treatment molecule?

  1. Using the code below, create a plot showing the percentage of patients who were male (gender == "Male") for each treatment arm molecule. Based on this plot, do you expect to find that there were differences in treatment molecule?
# Calculate percentage of Male in each treatment arm

molecule_gender <- trial %>%
  group_by(molecule) %>%
  summarise(male_p = mean(gender == "Male"))

ggplot(data = molecule_gender, mapping = aes(x = molecule, y = male_p)) + 
  geom_bar(stat = "identity") +
  ylim(c(0, 1)) +
  labs(title = "ACTG175 AIDS trial",
       subtitle = "Percent of males in each treatment arm",
       x = "Treatment Arm (molecule)",
       y = "Percent Male (gender)")
  1. Using the chisq.test() function, create an object called gender_molecule_test which conducts a chi-square test comparing the frequencies of each gender (gender) for each level of molecule (molecule).
  • Note: The chisq.test() function (unfortunately) does not support formulas, so we have to enter the data in as two separate (factor) vectors:
# Chi-square test comparing the frequencies of each gender for each molecule

XXX <- chisq.test(x = factor(trial$XXX),
                  y = factor(trial$XXX))
# Chi-square test comparing the frequencies of each gender for each molecule

gender_molecule_test <- chisq.test(x = factor(trial$gender),
                              y = factor(trial$molecule))
  1. Print the gender_molecule_test object to see summary outputs.
gender_molecule_test

    Pearson's Chi-squared test

data:  factor(trial$gender) and factor(trial$molecule)
X-squared = 1, df = 3, p-value = 0.7
  1. Apply the names() function to your gender_molecule_test object to see all of its named elements
# See all named elements of gender_molecule_test
names(XXX)
# See all named elements of gender_molecule_test
names(gender_molecule_test)
  1. Using the XXX$XXX notation, print the observed frequencies in your gender_molecule_test object. What do these values mean?
# Print observed frequencies in the gender_molecule_test object
XXX$XXX
# Print observed frequencies in the gender_molecule_test object
gender_molecule_test$observed
                    factor(trial$molecule)
factor(trial$gender) Didanosine Zidovudine zidovudine and Didanosine
              Female         91        100                        88
              Male          470        432                       434
                    factor(trial$molecule)
factor(trial$gender) Zidovudine and Zalcitabine
              Female                         89
              Male                          435
  1. Using the XXX$XXX notation, print the expected frequencies in your gender_molecule_test object. What do these values mean?
# Print expected frequencies in the gender_molecule_test object
XXX$XXX
# Print expected frequencies in the gender_molecule_test object
gender_molecule_test$expected
                    factor(trial$molecule)
factor(trial$gender) Didanosine Zidovudine zidovudine and Didanosine
              Female       96.5       91.5                      89.8
              Male        464.5      440.5                     432.2
                    factor(trial$molecule)
factor(trial$gender) Zidovudine and Zalcitabine
              Female                       90.2
              Male                        433.8
  1. Using the tidy() function (from the broom package), create a ‘tidy’ version of your gender_molecule_test object called gender_molecule_tidy
# Create a 'tidy' version of my gender_molecule_test called gender_molecule_tidy

gender_molecule_tidy <- tidy(xxx)
# Create a 'tidy' version of my gender_arm_test object

gender_molecule_tidy <- tidy(gender_molecule_test)
  1. Now, print your gender_molecule_tidy object, how do the results look?
gender_molecule_tidy
# A tibble: 1 x 4
  statistic p.value parameter method                    
      <dbl>   <dbl>     <int> <chr>                     
1      1.39   0.708         3 Pearson's Chi-squared test
Chi-square as Linear Model
  1. In the next section we’ll compare the chisq.test() result to one using glm(). However, before we do, we need to restructure the data a bit. Run the following code to create a dataframe containing the counts for each combination of molecule and gender:
# Create a dataframe showing counts for each combination of
#  molecule and gender

molecule_gender_counts <- trial %>%
                      group_by(molecule, gender) %>%
                      summarise(counts = n())
  1. Print your molecule_gender_counts object to see how it looks. What do the results mean?

  2. Conduct the same analysis you did before, but this time use the glm() function. Use the code skeleton below to help you. Do you get the same results as you did using chisq.test()? Which outputs are the same and which are different?

  • Be sure to model counts based on the interaction between molecule and gender using formula = counts ~ molecule * gender.
  • We’ll summarise the results of the model using the anova() function with the argument test = "Rao" to specify that we are conducting a score test (see https://en.wikipedia.org/wiki/Score_test for details)
# Create linear model predicting counts as a function of the 
#  interaction between molecule and gender

molecule_gender_glm <- glm(formula = XXX ~ XXX * XXX, # Model counts by the interaction of molecule and gender
                       family = "poisson",        # Use poisson model family to specify we are modelling counts
                       data = XXX)

# Print summary results as an anova table
anova(molecule_gender_glm, 
      test = 'Rao')   # Show a 'Rao" score test
# Create linear model predicting counts as a function of the 
#  interaction between molecule and gender

molecule_gender_glm <- glm(formula = counts ~ molecule * gender, 
                           family = "poisson", 
                           data = molecule_gender_counts)

# Print summary results as an anova table
anova(molecule_gender_glm, 
      test = 'Rao')   # Show a 'Rao" score test
Analysis of Deviance Table

Model: poisson, link: log

Response: counts

Terms added sequentially (first to last)

                Df Deviance Resid. Df Resid. Dev Rao Pr(>Chi)    
NULL                                7       1004                 
molecule         3        2         4       1003   2     0.61    
gender           1     1001         3          1 920   <2e-16 ***
molecule:gender  3        1         0          0   1     0.71    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Examples

library(tidyverse)
library(broom)

# We'll use the mpg dataset
mpg

# Correlation test -------------------------------------------------------------------

# Q: Is there a relationship between a car's city and highway miles per gallon? ======

my_test <- cor.test(formula = ~ cty + hwy,     # DV ~ IV
                     data = mpg)

my_test # Print result

#   Pearson's product-moment correlation
# 
# data:  cty and hwy
# t = 49.585, df = 232, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.9433129 0.9657663
# sample estimates:
#       cor 
# 0.9559159 

names(my_test)  # Print all named elements

# [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value"  "alternative" "method"     
# [8] "data.name"   "conf.int" 

my_test$estimate

#       cor 
# 0.9559159 

my_test$p.value

# [1] 1.868307e-125

tidy(my_test)

# # A tibble: 1 x 8
#   estimate statistic   p.value parameter conf.low conf.high method                           alternative
#      <dbl>     <dbl>     <dbl>     <int>    <dbl>     <dbl> <chr>                            <chr>      
# 1    0.956      49.6 1.87e-125       232    0.943     0.966 Pearson's product-moment correl… two.sided  


# 2-sample t-test -------------------------------------------------------------------

# Q: Has there been a change in highway gas mileage from 1999 to 2008?

my_ttest <- t.test(formula = hwy ~ factor(year),     # DV ~ IV
                   alternative = "two.sided",  # Two-sided test
                   data = mpg)         # The data
                   

my_ttest  # Print result

#   Welch Two Sample t-test
# 
# data:  hwy by factor(year)
# t = -0.032864, df = 231.64, p-value = 0.9738
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  -1.562854  1.511572
# sample estimates:
# mean in group 1999 mean in group 2008 
#           23.42735           23.45299 

names(my_ttest)   # Show all named elements

# [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"    "null.value"  "alternative"
# [8] "method"      "data.name"  

my_ttest$estimate  # Show estimates

# mean in group 1999 mean in group 2008 
#           23.42735           23.45299 

my_ttest$conf.int  # Show confidence interval of difference

# [1] -1.562854  1.511572
# attr(,"conf.level")
# [1] 0.95

my_ttest$p.value  # Show p-value

# [1] 0.9738111

tidy(my_ttest)  # Show tidy results

# # A tibble: 1 x 10
#   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method         alternative
#      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>          <chr>      
# 1  -0.0256      23.4      23.5   -0.0329   0.974      232.    -1.56      1.51 Welch Two Sam… two.sided  



# ANOVA  -------------------------------------------------------------------

# Q: Is there a difference in highway miles per gallon between car manufacturers?

my_aov <- aov(formula = hwy ~ manufacturer,     # DV ~ IV
              data = mpg)                       # The data
                   

my_aov  # Print result

# Call:
#    aov(formula = hwy ~ manufacturer, data = mpg)
# 
# Terms:
#                 manufacturer Residuals
# Sum of Squares      4459.858  3801.805
# Deg. of Freedom           14       219
# 
# Residual standard error: 4.166514
# Estimated effects may be unbalanced

summary(my_aov)   # Summary results

# Analysis of Variance Table
# 
# Response: hwy
#               Df Sum Sq Mean Sq F value    Pr(>F)    
# manufacturer  14 4459.9  318.56  18.351 < 2.2e-16 ***
# Residuals    219 3801.8   17.36                      
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


names(my_aov)   # Show all named elements

#  [1] "coefficients"  "residuals"     "effects"       "rank"          "fitted.values" "assign"       
#  [7] "qr"            "df.residual"   "contrasts"     "xlevels"       "call"          "terms"        
# [13] "model"

my_ttest$coefficients  # Show coefficients

 #         (Intercept)  manufacturerchevrolet      manufacturerdodge       manufacturerford 
 #         26.44444444            -4.54970760            -8.49849850            -7.08444444 
 #   manufacturerhonda    manufacturerhyundai       manufacturerjeep manufacturerland rover 
 #          6.11111111             0.41269841            -8.81944444            -9.94444444 
 # manufacturerlincoln    manufacturermercury     manufacturernissan    manufacturerpontiac 
 #         -9.44444444            -8.44444444            -1.82905983            -0.04444444 
 #  manufacturersubaru     manufacturertoyota manufacturervolkswagen 
 #         -0.87301587            -1.53267974             2.77777778 


tidy(my_aov)  # Show tidy results

#   term            df sumsq meansq statistic   p.value
#   <chr>        <dbl> <dbl>  <dbl>     <dbl>     <dbl>
# 1 manufacturer    14 4460.  319.       18.4  9.22e-30
# 2 Residuals      219 3802.   17.4      NA   NA   

Datasets

File Rows Columns
trial_ACTG175.csv 2139 27

The trial_ACTG175.csv data from the ACTG 175 clinical trial. ACTG 175 was a randomized clinical trial to compare monotherapy with zidovudine or didanosine with combination therapy with zidovudine and didanosine or zidovudine and zalcitabine in adults infected with the human immunodeficiency virus type I whose CD4 T cell counts were between 200 and 500 per cubic millimeter.

The trial is documented in Hammer SM, et al. (1996), “A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter.”, New England Journal of Medicine, 335:1081–1090. The full publication can be found at https://www.nejm.org/doi/full/10.1056/NEJM199610103351501

The data were originally obtained from the speff2trial package.

Variable description

Name Description
patient_id patient’s ID number
molecule treatment arm (0=zidovudine, 1=zidovudine and didanosine, 2=zidovudine and zalcitabine, 3=didanosine).
days number of days until the first occurrence of: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death.
age age in years at baseline
weight_kg weight in kg at baseline
drug_history history of intravenous drug use (0=no, 1=yes)
race race (0=white, 1=non-white)
gender gender (0=female, 1=male)
symptomatic symptomatic indicator (0=asymptomatic, 1=symptomatic)
cd4_baseline CD4 T cell count at baseline
cd4_96_weeks CD4 T cell count at 96±5 weeks (=NA if missing)
cd8_baseline CD8 T cell count at baseline
cd8_20_weeks CD8 T cell count at 20±5 weeks

Functions

Packages

Package Installation
tidyverse install.packages("tidyverse")
broom install.packages("broom")
rsq install.packages("rsq")

Functions

Function Package Description
cor.test() stats Correlation test
t.test() stats Student’s T-test
aov() stats Analysis of Variance
chisq.test() stats Chi-square test
glm() stats Generalized linear model
tidy() broom Create ‘tidy’ results from statistical objects
rsq() rsq Obtain and print R-squared values from a regression object

Resources

Vignettes