Basel R Bootcamp

from FinanceMonthly.com

In this practical you’ll practice specific hypothesis testing extensions of the linear model

By the end of this practical you will know how to:

- Conduct a t-test (one sample, two sample, and paired)
- Conduct a correlation test and compare to glm
- Conduct an ANOVA and compare to glm
- Select and run the appopriate test for the appropriate research question.

- Open your
`BaselRBootcamp`

R project. It should already have the folders`1_Data`

and`2_Code`

. Make sure that the data files listed in the`Datasets`

section above are in your`1_Data`

folder

`# Done!`

- Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called
`LinearModels_II_practical.R`

in the`2_Code`

folder.

`# Done!`

- Using
`library()`

load the packages below (if you don’t have them, you’ll need to install them first).

```
# Load packages necessary for this script
library(tidyverse)
library(broom)
library(rsq)
```

- Using the following template, load the
`trial_ACTG175.csv`

data into R and store it as a new object called`trial`

(Hint: Don’t type the path directly! Use the “tab” completion!).

```
# Load trial_ACTG175.csv from the 1_Data folder as a new object called trial
XX <- read_csv(file = "XX/XX")
```

```
# Load trial_act.csv from the 1_Data folder
trial <- read_csv(file = "1_Data/trial_ACTG175.csv")
```

```
Parsed with column specification:
cols(
patient_id = col_double(),
molecule = col_character(),
days = col_double(),
age = col_double(),
weight_kg = col_double(),
drug_history = col_character(),
race = col_character(),
gender = col_character(),
symptomatic = col_character(),
cd4_baseline = col_double(),
cd4_96_weeks = col_double(),
cd8_baseline = col_double(),
cd8_20_weeks = col_double()
)
```

- Take a look at the first few rows of the dataset(s) by printing it/them to the console, it should look like the table below:

patient_id | molecule | days | age | weight_kg | drug_history | race | gender | symptomatic | cd4_baseline | cd4_96_weeks | cd8_baseline | cd8_20_weeks |
---|---|---|---|---|---|---|---|---|---|---|---|---|

10056 | Zidovudine and Zalcitabine | 948 | 48 | 89.8 | No | White | Female | Asymptomatic | 422 | 660 | 566 | 324 |

10059 | Didanosine | 1002 | 61 | 49.4 | No | White | Female | Asymptomatic | 162 | NA | 392 | 564 |

10089 | Didanosine | 961 | 45 | 88.5 | Yes | White | Male | Asymptomatic | 326 | 122 | 2063 | 1893 |

10093 | Didanosine | 1166 | 47 | 85.3 | No | White | Male | Asymptomatic | 287 | NA | 1590 | 966 |

10124 | Zidovudine | 1090 | 43 | 66.7 | No | White | Male | Asymptomatic | 504 | 660 | 870 | 782 |

```
# Print trial object
XXX
```

`trial`

```
# A tibble: 2,139 x 13
patient_id molecule days age weight_kg drug_history race gender
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 10056 Zidovud… 948 48 89.8 No White Female
2 10059 Didanos… 1002 61 49.4 No White Female
3 10089 Didanos… 961 45 88.5 Yes White Male
4 10093 Didanos… 1166 47 85.3 No White Male
5 10124 Zidovud… 1090 43 66.7 No White Male
6 10140 zidovud… 1181 46 88.9 Yes White Male
7 10165 Zidovud… 794 31 73.0 No White Male
8 10190 Zidovud… 957 41 66.2 Yes White Male
9 10198 Didanos… 198 40 82.6 No White Male
10 10229 Zidovud… 188 35 78.0 No White Male
# … with 2,129 more rows, and 5 more variables: symptomatic <chr>,
# cd4_baseline <dbl>, cd4_96_weeks <dbl>, cd8_baseline <dbl>,
# cd8_20_weeks <dbl>
```

- Use the
`summary()`

function to print more details on the columns of the datasets.

```
# Show summary information from trial
summary(XXX)
```

`summary(trial)`

```
patient_id molecule days age
Min. : 10056 Length:2139 Min. : 14 Min. :12.0
1st Qu.: 81446 Class :character 1st Qu.: 727 1st Qu.:29.0
Median :190566 Mode :character Median : 997 Median :34.0
Mean :248778 Mean : 879 Mean :35.2
3rd Qu.:280277 3rd Qu.:1091 3rd Qu.:40.0
Max. :990077 Max. :1231 Max. :70.0
weight_kg drug_history race gender
Min. : 31.0 Length:2139 Length:2139 Length:2139
1st Qu.: 66.7 Class :character Class :character Class :character
Median : 74.4 Mode :character Mode :character Mode :character
Mean : 75.1
3rd Qu.: 82.6
Max. :159.9
symptomatic cd4_baseline cd4_96_weeks cd8_baseline
Length:2139 Min. : 0 Min. : 0 Min. : 40
Class :character 1st Qu.: 264 1st Qu.: 209 1st Qu.: 654
Mode :character Median : 340 Median : 321 Median : 893
Mean : 351 Mean : 329 Mean : 987
3rd Qu.: 423 3rd Qu.: 440 3rd Qu.:1207
Max. :1199 Max. :1190 Max. :5011
NA's :797
cd8_20_weeks
Min. : 124
1st Qu.: 632
Median : 865
Mean : 935
3rd Qu.:1146
Max. :6035
```

- Use the
`View()`

function to view the entire dataframe(s) in window(s)

```
# View trial dataset in new window
View(XXX)
```

`View(trial)`

Research Question: Is there a correlation between a patient’s age and their CD4 T cell count at baseline?

- Using the code below, create a plot showing the relationship between age and CD4 T cell count at baseline. Based on this plot, do you expect to find a relationship between the two variables? (Hint: See all of the possible named colors in R by running
`colors()`

. Then change the color of your histogram to your favorite one!)

```
ggplot(data = trial, mapping = aes(x = age, y = cd4_baseline)) +
geom_point(alpha = .2, col = "darkblue") +
labs(title = "ACTG175 Trial",
subtitle = "Baseline age and CD4 T cell count at baseline ",
x = "Age (age)",
y = "CD4 T cell count at baseline (cd4_baseline)")
```

- Using the
`cor.test()`

function, create an object called`age_cd4_baseline_test`

which conducts a correlation test between a patient’s age at baseline (`age`

) and their CD4 T cell count at baseline (`cd4_baseline`

)

```
# Correlation between patient's age at baseline and CD4 T cell count at baseline
XXX <- cor.test(formula = ~ XXX + XXX,
data = XXX)
```

```
# Correlation between patient's age at baseline and CD4 T cell count at baseline
age_cd4_baseline_test <- cor.test(formula = ~ age + cd4_baseline,
data = trial)
```

- Print the
`age_cd4_baseline_test`

object to see summary outputs.

`age_cd4_baseline_test`

```
Pearson's product-moment correlation
data: age and cd4_baseline
t = -2, df = 2000, p-value = 0.06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08254 0.00208
sample estimates:
cor
-0.0403
```

- Apply the
`names()`

function to your`age_cd4_baseline_test`

object to see all of its named elements

```
# See all named elements of age_cd4_baseline_test
names(XXX)
```

```
# See all named elements of age_cd4_baseline_test
names(age_cd4_baseline_test)
```

```
[1] "statistic" "parameter" "p.value" "estimate" "null.value"
[6] "alternative" "method" "data.name" "conf.int"
```

- Using the
`XXX$XXX`

notation, print only the correlation estimate (`estimate`

) in your`age_cd4_baseline_test`

object. What does this value tell you?

```
# Print correlation coefficient estimate in the age_cd4_baseline_test object
XXX$XXX
```

```
# Print correlation coefficient estimate
age_cd4_baseline_test$estimate
```

```
cor
-0.0403
```

- Again using the
`XXX$XXX`

notation, print only the p value (`p.value`

) in your`age_cd4_baseline_test`

object. What does this value tell you?

```
# Print the p value of the correlation coefficient in the age_cd4_baseline_test object
XXX$XXX
```

```
# Print the p value in the age_cd4_baseline_test object
age_cd4_baseline_test$conf.int
```

```
[1] -0.08254 0.00208
attr(,"conf.level")
[1] 0.95
```

- Again using the
`XXX$XXX`

notation, print only the confidence interval (`conf.int`

) in your`age_cd4_baseline_test`

object. What does this value tell you?

```
# Print the confidence interval of the correlation coefficient in the age_cd4_baseline_test object
XXX$XXX
```

```
# Print the confidence interval in the age_cd4_baseline_test object
age_cd4_baseline_test$conf.int
```

```
[1] -0.08254 0.00208
attr(,"conf.level")
[1] 0.95
```

- Using the
`tidy()`

function (from the`broom`

package), create a ‘tidy’ version of your`age_cd4_baseline_test`

object called`age_cd4_baseline_tidy`

```
# Create a 'tidy' version of my age_cd4_baseline_test object
age_cd4_baseline_tidy <- tidy(xxx)
```

```
# Create a 'tidy' version of my age_cd4_baseline_test object
age_cd4_baseline_tidy <- tidy(age_cd4_baseline_test)
```

- Now, print your
`age_cd4_baseline_tidy`

object, how do the results look?

`age_cd4_baseline_tidy`

```
# A tibble: 1 x 8
estimate statistic p.value parameter conf.low conf.high method
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
1 -0.0403 -1.86 0.0624 2137 -0.0825 0.00208 Pears…
# … with 1 more variable: alternative <chr>
```

- Conduct the same analysis you did before, but this time use the
`glm()`

function. Use the code skeleton below to help you. Do you get the same results as you did using`cor.test()`

?

```
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline
age_cd4_baseline_glm <- glm(formula = XXX ~ XXX,
data = XXX)
# Print 'tidy' results
tidy(XXX)
```

```
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline
age_cd4_baseline_glm <- glm(formula = cd4_baseline ~ age,
data = trial)
# Print 'tidy' results
tidy(age_cd4_baseline_glm)
```

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 370. 10.7 34.6 8.09e-209
2 age -0.549 0.294 -1.86 6.24e- 2
```

Research Question: Do patients with a history of intravenous drug use have higher CD4 T cell counts at baseline compared to patients without a history of intravenous drug use?

- Using the code below, create a plot showing the distribution of CD4 T cell counts at baseline (
`cd4_baseline`

) depending on whether or not patients were symptomatic. Based on this plot, do you expect to find that symptomatic patients generally have higher (or lower) CD4 T cell counts?

```
ggplot(data = trial, mapping = aes(x = symptomatic, y = cd4_baseline, col = symptomatic)) +
stat_summary(geom = "bar", fun.y = "mean", fill = "white", col = "black") +
geom_jitter(width = .1, alpha = .1) +
labs(title = "ACTG175 Trial",
subtitle = "Baseline CD4 T cell count and symptoms",
x = "Are patients symptomatic? (symptomatic)",
y = "CD4 T cell count at baseline (cd4_baseline)") +
guides(col = FALSE)
```

- Using the
`t.test()`

function, create an object called`symptomatic_cd4_baseline_test`

which conducts a two-sample t-test comparing patients’ CD4 T cell count at baseline (`cd4_baseline`

) between patients that are symptomatic and those that are not (`symptomatic`

)

```
# T-test comparing CD4 T cell counts at baseline bewteen symptomatic and non-symptomatic patients
XXX <- t.test(formula = XXX ~ XXX,
data = XXX)
```

```
# T-test comparing CD4 T cell counts at baseline bewteen symptomatic and non-symptomatic patients
symptomatic_cd4_baseline_test <- t.test(formula = cd4_baseline ~ symptomatic,
data = trial)
```

- Print the
`symptomatic_cd4_baseline_test`

object to see summary outputs.

`age_cd4_baseline_test`

```
Pearson's product-moment correlation
data: age and cd4_baseline
t = -2, df = 2000, p-value = 0.06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08254 0.00208
sample estimates:
cor
-0.0403
```

- Apply the
`names()`

function to your`symptomatic_cd4_baseline_test`

object to see all of its named elements

```
# See all named elements of symptom_cd4_baseline_test
names(XXX)
```

```
# See all named elements of symptomatic_cd4_baseline_test
names(symptomatic_cd4_baseline_test)
```

```
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
```

- Using the
`XXX$XXX`

notation, print only the correlation estimate (`estimate`

) in your`symptomatic_cd4_baseline_test`

object. What does the result tell you?

```
# Print parameter estimates in the symptomatic_cd4_baseline_test object
XXX$XXX
```

```
# Print parameter estimates in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$estimate
```

```
mean in group Asymptomatic mean in group Symptomatic
358 317
```

- Again using the
`XXX$XXX`

notation, print only the p value (`p.value`

) in your`symptomatic_cd4_baseline_test`

object. What does this value tell you?

```
# Print the p value of the symptomatic_cd4_baseline_test object
XXX$XXX
```

```
# Print the p value in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$p.value
```

`[1] 1.66e-10`

- Again using the
`XXX$XXX`

notation, print only the confidence interval (`conf.int`

) in your`symptomatic_cd4_baseline_test`

object. What does this value tell you?

```
# Print the confidence interval in the symptomatic_cd4_baseline_test object
XXX$XXX
```

```
# Print the confidence interval in the symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_test$conf.int
```

```
[1] 28.7 53.5
attr(,"conf.level")
[1] 0.95
```

- Using the
`tidy()`

function (from the`broom`

package), create a ‘tidy’ version of your`symptomatic_cd4_baseline_test`

object called`symptomatic_cd4_baseline_tidy`

```
# Create a 'tidy' version of my symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_tidy <- tidy(xxx)
```

```
# Create a 'tidy' version of my symptomatic_cd4_baseline_test object
symptomatic_cd4_baseline_tidy <- tidy(symptomatic_cd4_baseline_test)
```

- Now, print your
`symptomatic_cd4_baseline_tidy`

object, how do the results look?

`symptomatic_cd4_baseline_tidy`

```
# A tibble: 1 x 10
estimate estimate1 estimate2 statistic p.value parameter conf.low
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 41.1 358. 317. 6.51 1.66e-10 573. 28.7
# … with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
```

- Conduct the same analysis you did before, but this time use the
`glm()`

function. Use the code skeleton below to help you. Do you get the same results as you did using`t.test()`

? Which outputs are the same and which are different?

```
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline
symptomatic_cd4_baseline_glm <- glm(formula = XXX ~ XXX,
data = XXX)
# Print 'tidy' results
tidy(XXX)
```

```
# GLM predicting a patient's CD4 T cell count at baseline form their age at baseline
symptomatic_cd4_baseline_glm <- glm(formula = cd4_baseline ~ symptomatic,
data = trial)
# Print 'tidy' results
tidy(symptomatic_cd4_baseline_glm)
```

```
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 358. 2.80 128. 0
2 symptomaticSymptomatic -41.1 6.72 -6.11 0.00000000119
```

Research Question: Was there a difference between the treatment molecule in the number of days until the first occurrence of one of the following major negative events: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death?

- Using the code below, create a plot showing the distribution of the number of days until a major negative event (
`days`

) for each treatment molecule`molecule`

Based on this plot, do you expect to find that there were differences in treatment molecules?

```
ggplot(data = trial, mapping = aes(x = molecule, y = days, col = molecule)) +
stat_summary(geom = "bar", fun.y = "mean", fill = "white", col = "black") +
geom_jitter(width = .2, alpha = .2) +
labs(title = "ACTG175 Trial",
subtitle = "Did the molecule affect days until negative events?",
x = "Treatment Arm (molecule)",
y = "Number of days until a major negative event (days)") +
guides(col = FALSE)
```

- Using the
`aov()`

function, create an object called`days_molecule_aov`

which conducts an ANOVA comparing the number of days until a negative event (`days`

) between treatment molecule (`molecule`

).

```
# ANOVA comparing number of days until a negative event between treatment molecule
XXX <- aov(formula = XXX ~ XXX,
data = XXX)
```

```
# ANOVA comparing number of days until a negative event between treatment molecule
days_molecule_aov <- aov(formula = days ~ molecule,
data = trial)
```

- Print the
`days_molecule_aov`

object to see summary outputs.

`days_molecule_aov`

```
Call:
aov(formula = days ~ molecule, data = trial)
Terms:
molecule Residuals
Sum of Squares 4.43e+06 1.78e+08
Deg. of Freedom 3 2135
Residual standard error: 289
Estimated effects may be unbalanced
```

- Apply the
`names()`

function to your`days_molecule_aov`

object to see all of its named elements

```
# See all named elements of days_molecule_aov
names(XXX)
```

```
# See all named elements of symptom_cd4_baseline_test
names(days_molecule_aov)
```

```
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "contrasts" "xlevels" "call" "terms"
[13] "model"
```

- Using the
`XXX$XXX`

notation, print the coefficients in your`days_molecule_aov`

object. What do these values mean?

```
# Print coefficients in the days_molecule_aov object
XXX$XXX
```

```
# Print coefficients in the days_molecule_aov object
days_molecule_aov$coefficients
```

```
(Intercept) moleculeZidovudine
893.5 -92.2
moleculezidovudine and Didanosine moleculeZidovudine and Zalcitabine
22.8 12.3
```

- Using the
`tidy()`

function (from the`broom`

package), create a ‘tidy’ version of your`days_molecule_aov`

object called`days_molecule_tidy`

```
# Create a 'tidy' version of my symptom_cd4_baseline_test days_molecule_aov
days_molecule_tidy <- tidy(xxx)
```

```
# Create a 'tidy' version of my days_molecule_aov object
days_molecule_tidy <- tidy(days_molecule_aov)
```

- Now, print your
`days_molecule_tidy`

object, how do the results look?

`days_molecule_tidy`

```
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 molecule 3 4433564. 1477855. 17.7 2.37e-11
2 Residuals 2135 178203547. 83468. NA NA
```

- In order to compare pair-wise differences between groups after conducting an ANOVA, you need to conduct post-hoc tests. Using the
`TukeyHSD()`

function, conduct pair-wise tests to see which pairs of treatment molecule differ in`days`

. Save the result to an object called`days_molecule_aov_post`

`

```
# Conduct pair-wise tests comparing every pair of treatment
# molecule (molecule) on days until a major negative event (days)
XXX <- TukeyHSD(XXX)
```

```
# Conduct pair-wise tests comparing every pair of treatment
# molecule (molecule) on days until a major negative event (days)
days_molecule_aov_post <- TukeyHSD(days_molecule_aov)
```

Print your

`days_molecule_aov_post`

object, what do you see?Print ‘tidy’ results from your

`days_molecule_aov_post`

object using the`tidy()`

function

```
# Print 'tidy' version of days_molecule_aov_post
tidy(XXX)
```

- Based on what you’ve seen, what are your conclusions? Which treatment molecule seem to differ?

- Conduct the same analysis you did before, but this time use the
`glm()`

function. Use the code skeleton below to help you. Do you get the same results as you did using`aov()`

? Which outputs are the same and which are different?

```
# GLM predicting number of days until a major negative event (days) as a function of treatment arm (molecule)
XXX <- glm(formula = XXX ~ XXX,
data = XXX)
# Print object
XXX
# Print 'tidy' results
tidy(XXX)
```

```
# GLM predicting number of days until a major negative event (days) as a function of treatment arm (molecule)
days_molecule_glm <- glm(formula = days ~ molecule,
data = trial)
# Print object
days_molecule_glm
```

```
Call: glm(formula = days ~ molecule, data = trial)
Coefficients:
(Intercept) moleculeZidovudine
893.5 -92.2
moleculezidovudine and Didanosine moleculeZidovudine and Zalcitabine
22.8 12.3
Degrees of Freedom: 2138 Total (i.e. Null); 2135 Residual
Null Deviance: 1.83e+08
Residual Deviance: 1.78e+08 AIC: 30300
```

```
# Print 'tidy' results
tidy(days_molecule_glm)
```

```
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 893. 12.2 73.2 0.
2 moleculeZidovudine -92.2 17.5 -5.28 1.45e-7
3 moleculezidovudine and Didanosine 22.8 17.6 1.30 1.95e-1
4 moleculeZidovudine and Zalcitabi… 12.3 17.6 0.699 4.85e-1
```

- It’s time to conduct a regression analysis predicting day the number of days until a major negative event
`days`

based on several variables. Perform a regression analysis predicting days until a major negative event`days`

based on the following variables:`molecule`

,`gender`

,`age`

,`weight_kg`

,`drug_history`

,`race`

,`cd4_baseline`

, and`cd8_baseline`

. Use the code chunk below to help you

```
# Predict days based on many variables
days_glm <- glm(formula = XXX ~ XXX + XXX + XXX ...,
data = XXX)
```

```
# Predict days based on many variables
days_glm <- glm(formula = days ~ molecule + gender + age + weight_kg + drug_history + race + cd4_baseline + cd8_baseline,
data = trial)
```

- Using
`tidy()`

look at the summary results. Which variables seem to be able to predict the number of days until a major negative event?

`tidy(XXX)`

`tidy(days_glm)`

```
# A tibble: 11 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.84e+2 46.1 14.9 1.51e-47
2 moleculeZidovudine -9.50e+1 17.1 -5.55 3.23e- 8
3 moleculezidovudine and Didanosine 2.25e+1 17.2 1.31 1.91e- 1
4 moleculeZidovudine and Zalcitabine 9.59e+0 17.2 0.558 5.77e- 1
5 genderMale 9.36e+0 17.6 0.531 5.96e- 1
6 age 1.16e+0 0.715 1.62 1.06e- 1
7 weight_kg 6.92e-2 0.481 0.144 8.85e- 1
8 drug_historyYes -1.88e+1 18.4 -1.02 3.06e- 1
9 raceWhite 2.56e+1 14.2 1.80 7.14e- 2
10 cd4_baseline 5.15e-1 0.0530 9.72 6.97e-22
11 cd8_baseline -3.96e-2 0.0131 -3.01 2.62e- 3
```

- Using
`rsq()`

calculate the r-squared value of your model. Was it high or low? Recall that r-squared can range anywhere from 0 (meaning the model explains*none*of the variance in`days`

), all the way up to 1 (meaning the model expllains*all*of the variance in`days`

).

`rsq(XXX)`

`rsq(days_glm)`

`[1] 0.0695`

- Using the code below, visualise the residuals from the model. What does this mean?

```
# Save residuals as days_glm_resid
days_glm_resid <- resid(days_glm)
# Plot histogram
ggplot(data = tibble(days_glm_resid), aes(x = days_glm_resid)) +
geom_histogram(col = "white", fill = "skyblue") +
geom_vline(xintercept = 0) +
labs(title = "Residuals of regression model predicting days",
x = "Absolute residuals abs(Fit - Truth)",
y = 'Frequency')
```

- Using the code below, calculate the mean residuals from the model. What do these mean?

```
# Calculate the mean residuals of the model
mean(resid(XXX))
```

- Using the code below, visualise the
*absolute*residuals from the model. What does this mean?

```
# Save residuals as days_glm_resid
days_glm_absresid <- abs(resid(days_glm))
# Plot histogram
ggplot(data = tibble(days_glm_absresid), aes(x = days_glm_absresid)) +
geom_histogram(col = "white", fill = "skyblue") +
geom_vline(xintercept = 0) +
labs(title = "Absolute residuals of regression model predicting days",
x = "Absolute residuals abs(Fit - Truth)",
y = 'Frequency')
```

- Using the code below, calculate the mean absolute residuals from the model. What do these mean?

```
# Calculate the mean absolute residuals of the model
mean(abs(resid(XXX)))
```

For the remaining tasks, you choose the test! For each task, do the following:

- Visualize the data (use the code given in the sections above for help!)
- Conduct the appropriate test using an applied hypothesis test function (i.e.; correlation test, t-test, ANOVA or chi-square test).
- Explore the results (e.g.; estimate, confidence interval, p-value).
- Conduct the same test using the generalized linear model function
`glm()`

- Explore the results and confirm you get the same result from your applied function!

Do people with a history of intravenous drug use (

`drug_history`

) tend to have major negative events (`days`

) faster than those without a history of intravenous drug use?Do men and women (

`gender`

) have different CD8 T cell counts at baseline (`cd8_baseline`

)?Is there a relationship between patient’s CD4 T cell counts at baseline (

`cd4_baseline`

) and at 96 weeks (`cd4_96_weeks`

)?Did older patients have higher CD4 T cell counts at baseline (

`cd4_baseline`

) compared to younger patients?Was there an effect of treatment arm (

`molecule`

) on the*change*of CD4 T cell counts between baseline (`cd4_baseline`

) and 96 weeks (`cd4_96_weeks)? Hint: to do this, you first need to create a column that shows the change in CD4 T cell counts. The code below should help you with this:

```
trial <- trial %>%
mutate(cd4_96_weeks_change = cd4_96_weeks - cd4_baseline) # Add cd4_96_weeks_change to trial, indicating change from
# in cd4 T cell count from baseline to 96 weeks.
```

Research Question: Was there an unequal distribution of gender in the treatment molecule?

- Using the code below, create a plot showing the percentage of patients who were male (
`gender == "Male")`

for each treatment arm`molecule`

. Based on this plot, do you expect to find that there were differences in treatment molecule?

```
# Calculate percentage of Male in each treatment arm
molecule_gender <- trial %>%
group_by(molecule) %>%
summarise(male_p = mean(gender == "Male"))
ggplot(data = molecule_gender, mapping = aes(x = molecule, y = male_p)) +
geom_bar(stat = "identity") +
ylim(c(0, 1)) +
labs(title = "ACTG175 AIDS trial",
subtitle = "Percent of males in each treatment arm",
x = "Treatment Arm (molecule)",
y = "Percent Male (gender)")
```

- Using the
`chisq.test()`

function, create an object called`gender_molecule_test`

which conducts a chi-square test comparing the frequencies of each gender (`gender`

) for each level of molecule (`molecule`

).

- Note: The
`chisq.test()`

function (unfortunately) does not support formulas, so we have to enter the data in as two separate (factor) vectors:

```
# Chi-square test comparing the frequencies of each gender for each molecule
XXX <- chisq.test(x = factor(trial$XXX),
y = factor(trial$XXX))
```

```
# Chi-square test comparing the frequencies of each gender for each molecule
gender_molecule_test <- chisq.test(x = factor(trial$gender),
y = factor(trial$molecule))
```

- Print the
`gender_molecule_test`

object to see summary outputs.

`gender_molecule_test`

```
Pearson's Chi-squared test
data: factor(trial$gender) and factor(trial$molecule)
X-squared = 1, df = 3, p-value = 0.7
```

- Apply the
`names()`

function to your`gender_molecule_test`

object to see all of its named elements

```
# See all named elements of gender_molecule_test
names(XXX)
```

```
# See all named elements of gender_molecule_test
names(gender_molecule_test)
```

- Using the
`XXX$XXX`

notation, print the*observed*frequencies in your`gender_molecule_test`

object. What do these values mean?

```
# Print observed frequencies in the gender_molecule_test object
XXX$XXX
```

```
# Print observed frequencies in the gender_molecule_test object
gender_molecule_test$observed
```

```
factor(trial$molecule)
factor(trial$gender) Didanosine Zidovudine zidovudine and Didanosine
Female 91 100 88
Male 470 432 434
factor(trial$molecule)
factor(trial$gender) Zidovudine and Zalcitabine
Female 89
Male 435
```

- Using the
`XXX$XXX`

notation, print the*expected*frequencies in your`gender_molecule_test`

object. What do these values mean?

```
# Print expected frequencies in the gender_molecule_test object
XXX$XXX
```

```
# Print expected frequencies in the gender_molecule_test object
gender_molecule_test$expected
```

```
factor(trial$molecule)
factor(trial$gender) Didanosine Zidovudine zidovudine and Didanosine
Female 96.5 91.5 89.8
Male 464.5 440.5 432.2
factor(trial$molecule)
factor(trial$gender) Zidovudine and Zalcitabine
Female 90.2
Male 433.8
```

- Using the
`tidy()`

function (from the`broom`

package), create a ‘tidy’ version of your`gender_molecule_test`

object called`gender_molecule_tidy`

```
# Create a 'tidy' version of my gender_molecule_test called gender_molecule_tidy
gender_molecule_tidy <- tidy(xxx)
```

```
# Create a 'tidy' version of my gender_arm_test object
gender_molecule_tidy <- tidy(gender_molecule_test)
```

- Now, print your
`gender_molecule_tidy`

object, how do the results look?

`gender_molecule_tidy`

```
# A tibble: 1 x 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 1.39 0.708 3 Pearson's Chi-squared test
```

- In the next section we’ll compare the
`chisq.test()`

result to one using`glm()`

. However, before we do, we need to restructure the data a bit. Run the following code to create a dataframe containing the counts for each combination of molecule and gender:

```
# Create a dataframe showing counts for each combination of
# molecule and gender
molecule_gender_counts <- trial %>%
group_by(molecule, gender) %>%
summarise(counts = n())
```

Print your

`molecule_gender_counts`

object to see how it looks. What do the results mean?Conduct the same analysis you did before, but this time use the

`glm()`

function. Use the code skeleton below to help you. Do you get the same results as you did using`chisq.test()`

? Which outputs are the same and which are different?

- Be sure to model
`counts`

based on the*interaction*between`molecule`

and`gender`

using`formula = counts ~ molecule * gender`

. - We’ll summarise the results of the model using the
`anova()`

function with the argument`test = "Rao"`

to specify that we are conducting a score test (see https://en.wikipedia.org/wiki/Score_test for details)

```
# Create linear model predicting counts as a function of the
# interaction between molecule and gender
molecule_gender_glm <- glm(formula = XXX ~ XXX * XXX, # Model counts by the interaction of molecule and gender
family = "poisson", # Use poisson model family to specify we are modelling counts
data = XXX)
# Print summary results as an anova table
anova(molecule_gender_glm,
test = 'Rao') # Show a 'Rao" score test
```

```
# Create linear model predicting counts as a function of the
# interaction between molecule and gender
molecule_gender_glm <- glm(formula = counts ~ molecule * gender,
family = "poisson",
data = molecule_gender_counts)
# Print summary results as an anova table
anova(molecule_gender_glm,
test = 'Rao') # Show a 'Rao" score test
```

```
Analysis of Deviance Table
Model: poisson, link: log
Response: counts
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Rao Pr(>Chi)
NULL 7 1004
molecule 3 2 4 1003 2 0.61
gender 1 1001 3 1 920 <2e-16 ***
molecule:gender 3 1 0 0 1 0.71
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
library(tidyverse)
library(broom)
# We'll use the mpg dataset
mpg
# Correlation test -------------------------------------------------------------------
# Q: Is there a relationship between a car's city and highway miles per gallon? ======
my_test <- cor.test(formula = ~ cty + hwy, # DV ~ IV
data = mpg)
my_test # Print result
# Pearson's product-moment correlation
#
# data: cty and hwy
# t = 49.585, df = 232, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# 0.9433129 0.9657663
# sample estimates:
# cor
# 0.9559159
names(my_test) # Print all named elements
# [1] "statistic" "parameter" "p.value" "estimate" "null.value" "alternative" "method"
# [8] "data.name" "conf.int"
my_test$estimate
# cor
# 0.9559159
my_test$p.value
# [1] 1.868307e-125
tidy(my_test)
# # A tibble: 1 x 8
# estimate statistic p.value parameter conf.low conf.high method alternative
# <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
# 1 0.956 49.6 1.87e-125 232 0.943 0.966 Pearson's product-moment correl… two.sided
# 2-sample t-test -------------------------------------------------------------------
# Q: Has there been a change in highway gas mileage from 1999 to 2008?
my_ttest <- t.test(formula = hwy ~ factor(year), # DV ~ IV
alternative = "two.sided", # Two-sided test
data = mpg) # The data
my_ttest # Print result
# Welch Two Sample t-test
#
# data: hwy by factor(year)
# t = -0.032864, df = 231.64, p-value = 0.9738
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -1.562854 1.511572
# sample estimates:
# mean in group 1999 mean in group 2008
# 23.42735 23.45299
names(my_ttest) # Show all named elements
# [1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative"
# [8] "method" "data.name"
my_ttest$estimate # Show estimates
# mean in group 1999 mean in group 2008
# 23.42735 23.45299
my_ttest$conf.int # Show confidence interval of difference
# [1] -1.562854 1.511572
# attr(,"conf.level")
# [1] 0.95
my_ttest$p.value # Show p-value
# [1] 0.9738111
tidy(my_ttest) # Show tidy results
# # A tibble: 1 x 10
# estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
# 1 -0.0256 23.4 23.5 -0.0329 0.974 232. -1.56 1.51 Welch Two Sam… two.sided
# ANOVA -------------------------------------------------------------------
# Q: Is there a difference in highway miles per gallon between car manufacturers?
my_aov <- aov(formula = hwy ~ manufacturer, # DV ~ IV
data = mpg) # The data
my_aov # Print result
# Call:
# aov(formula = hwy ~ manufacturer, data = mpg)
#
# Terms:
# manufacturer Residuals
# Sum of Squares 4459.858 3801.805
# Deg. of Freedom 14 219
#
# Residual standard error: 4.166514
# Estimated effects may be unbalanced
summary(my_aov) # Summary results
# Analysis of Variance Table
#
# Response: hwy
# Df Sum Sq Mean Sq F value Pr(>F)
# manufacturer 14 4459.9 318.56 18.351 < 2.2e-16 ***
# Residuals 219 3801.8 17.36
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
names(my_aov) # Show all named elements
# [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"
# [7] "qr" "df.residual" "contrasts" "xlevels" "call" "terms"
# [13] "model"
my_ttest$coefficients # Show coefficients
# (Intercept) manufacturerchevrolet manufacturerdodge manufacturerford
# 26.44444444 -4.54970760 -8.49849850 -7.08444444
# manufacturerhonda manufacturerhyundai manufacturerjeep manufacturerland rover
# 6.11111111 0.41269841 -8.81944444 -9.94444444
# manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac
# -9.44444444 -8.44444444 -1.82905983 -0.04444444
# manufacturersubaru manufacturertoyota manufacturervolkswagen
# -0.87301587 -1.53267974 2.77777778
tidy(my_aov) # Show tidy results
# term df sumsq meansq statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 manufacturer 14 4460. 319. 18.4 9.22e-30
# 2 Residuals 219 3802. 17.4 NA NA
```

File | Rows | Columns |
---|---|---|

trial_ACTG175.csv | 2139 | 27 |

The `trial_ACTG175.csv`

data from the ACTG 175 clinical trial. ACTG 175 was a randomized clinical trial to compare monotherapy with zidovudine or didanosine with combination therapy with zidovudine and didanosine or zidovudine and zalcitabine in adults infected with the human immunodeficiency virus type I whose CD4 T cell counts were between 200 and 500 per cubic millimeter.

The trial is documented in Hammer SM, et al. (1996), “A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter.”, New England Journal of Medicine, 335:1081–1090. The full publication can be found at https://www.nejm.org/doi/full/10.1056/NEJM199610103351501

The data were originally obtained from the `speff2trial`

package.

Name | Description |
---|---|

patient_id | patient’s ID number |

molecule | treatment arm (0=zidovudine, 1=zidovudine and didanosine, 2=zidovudine and zalcitabine, 3=didanosine). |

days | number of days until the first occurrence of: (i) a decline in CD4 T cell count of at least 50 (ii) an event indicating progression to AIDS, or (iii) death. |

age | age in years at baseline |

weight_kg | weight in kg at baseline |

drug_history | history of intravenous drug use (0=no, 1=yes) |

race | race (0=white, 1=non-white) |

gender | gender (0=female, 1=male) |

symptomatic | symptomatic indicator (0=asymptomatic, 1=symptomatic) |

cd4_baseline | CD4 T cell count at baseline |

cd4_96_weeks | CD4 T cell count at 96±5 weeks (=NA if missing) |

cd8_baseline | CD8 T cell count at baseline |

cd8_20_weeks | CD8 T cell count at 20±5 weeks |

Package | Installation |
---|---|

`tidyverse` |
`install.packages("tidyverse")` |

`broom` |
`install.packages("broom")` |

`rsq` |
`install.packages("rsq")` |

Function | Package | Description |
---|---|---|

`cor.test()` |
`stats` |
Correlation test |

`t.test()` |
`stats` |
Student’s T-test |

`aov()` |
`stats` |
Analysis of Variance |

`chisq.test()` |
`stats` |
Chi-square test |

`glm()` |
`stats` |
Generalized linear model |

`tidy()` |
`broom` |
Create ‘tidy’ results from statistical objects |

`rsq()` |
`rsq` |
Obtain and print R-squared values from a regression object |

- Many of these examples come from Jonas Kristoffer Lindelov’s great blog post Common statistical tests are linear models, or: how to teach stats
- Discovering Statistics with R by Field et al is an excellent resource
- YaRrr! The Pirate’s Guide to R has good chapters on statistics with R.