Source: https://cdn.lynda.com/course/495322/495322-636154038826424503-16x9.jpg

Source: https://cdn.lynda.com/course/495322/495322-636154038826424503-16x9.jpg

Slides

Overview

In this practical you’ll do basic statistics in R. By the end of this practical you will know how to:

  1. Calculate basic descriptive statistics using mean(), median(), table().
  2. Conduct 1 and 2 sample hypothesis tests with t.test(), cor.test() and chisq.test()
  3. Calculate regression analyses with glm()
  4. Explore statistical objects with names(), summary(), print(), predict()
  5. Use sampling functions such as rnorm() to conduct simulations

Glossary and Packages

Here are the main descriptive statistics functions we will be covering.

Function Description
table() Frequency table
mean(), median(), mode() Measures of central tendency
sd(), range(), var() Measures of variability
max(), min() Extreme values
summary() Several summary statistics

Here are the main statistical functions we will be covering.

Function Hypothesis Test Additional Help
t.test() One and two sample t-test https://bookdown.org/ndphillips/YaRrr/htests.html#t-test-t.test
cor.test() Correlation test https://bookdown.org/ndphillips/YaRrr/htests.html#correlation-cor.test
chisq.test() Chi-Square test https://bookdown.org/ndphillips/YaRrr/htests.html#chi-square-chsq.test
glm() Regression https://bookdown.org/ndphillips/YaRrr/regression.html

Here are the random sampling functions we will use

Function Description Additional Help
sample() Draw a random sample of values from a vector ?sample
rnorm() Draw random values from a Normal distribution ?rnorm()
runif Draw random values from a Uniform distribution ?runif()

Examples

# -----------------------------------------------
# Examples of hypothesis tests on the diamonds
# ------------------------------------------------
library(tidyverse)

# First few rows of the diamonds data

diamonds

# -----
# Descriptive statistics
# -----

mean(diamonds$carat)   # What is the mean carat?
median(diamonds$price)   # What is the median price?
max(diamonds$depth)    # What is the maximum depth?
table(diamonds$color)    # How many observations for each color?

# -----
# 1-sample hypothesis test
# -----

# Q: Is the mean carat of diamonds different from .50?

htest_A <- t.test(x = diamonds$carat,     # The data
                  alternative = "two.sided",  # Two-sided test
                  mu = 0.5)                   # The null hyopthesis

htest_A            # Print result
names(htest_A)     # See all attributes in object
htest_A$statistic  # Get just the test statistic
htest_A$p.value    # Get the p-value
htest_A$conf.int   # Get a confidence interval

# -----
# 2-sample hypothesis test
# -----

# Q: Is there a difference in the carats of color = E and color = I diamonds?

htest_B <- t.test(formula = carat ~ color,     # DV ~ IV
                   alternative = "two.sided",  # Two-sided test
                   data = diamonds,         # The data
                   subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2

htest_B  # Print result

# -----
# Correlation test
# ------

# Q: Is there a correlation between carat and price?

htest_C <- cor.test(formula = ~ carat + price,
                    data = diamonds)

htest_C

# A: Yes. r = 0.92, t(53938) = 551.51, p < .001

# -----
# Regression
# ------

# Q: Create regression equation predicting price by carat, depth, table, and x

price_glm <- glm(formula = price ~ carat + depth + table + x,
                  data = diamonds)

# Print coefficients
price_glm

# Q: Predict the price of the following 3 diamonds:

diamonds_new <- tibble(carat = c(.2, .6, .5),
                       depth = c(50, 46, 90),
                       table = c(40, 65, 70),
                       x = c(3.7, 4.2, 4.3))

predict(price_glm, 
        newdata = diamonds_new)

#       1         2         3 
# 4190.453  6088.398 -4471.817 


# -----
# Simulation
# ------

# 100 random samples from a normal distribution with mean = 0, sd = 1
samp_A <- rnorm(n = 100, mean = 0, sd = 1)

# 100 random samples from a Uniform distribution with bounds at 0, 10
samp_B <- runif(n = 100, min = 0, max = 10)

# Calculate descriptives
mean(samp_A)
sd(samp_A)

mean(samp_B)
sd(samp_B)

Tasks

Data

You’ll use one dataset in this practical: ACTG175.csv. It is available in the data_BaselRBootcamp_Day2.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip folder and unzip it to get the ACTG175.csv file.

Getting setup

A. Open your BaselRBootcamp project. This project should have the folders R and data in the working directory. The ACTG175.csv file should already be in the data folder. (If it isn’t, you’ll need to get it from the data_BaselRBootcamp_Day2.zip folder and move it to the data folder).

# Done!

A. Open a new R script and save it in the R folder you just created as a new file called statistics_practical.R. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse and speff2trial packages with library(). Here’s how the top of your script should look:

## NAME
## DATE
## Statistics Practical

library(tidyverse)     # For tidyverse
library(speff2trial)   # For the ACTG175 data documentation

B. For this practical, we’ll use the ACTG175 data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act.

# Load ACTG175.csv from the data folder 
# The ACTG175.csv file MUST be in a folder called data in your working directory!!

trial_act <- read_csv(file = "data/ACTG175.csv")
trial_act <- as_tibble(trial_act)

C. First thing’s first, take a look at the data by printing it. It should look like this

# Print trial_act
head(trial_act)
# A tibble: 6 x 27
  pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti
   <int> <int> <dbl> <int> <int> <int>  <int>  <int> <int>  <int>   <int>
1  10056    48  89.8     0     0     0    100      0     0      1       0
2  10059    61  49.4     0     0     0     90      0     1      1     895
3  10089    45  88.5     0     1     1     90      0     1      1     707
4  10093    47  85.3     0     1     0    100      0     1      1    1399
5  10124    43  66.7     0     1     0    100      0     1      1    1352
6  10140    46  88.9     0     1     1    100      0     1      1    1181
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
#   strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
#   cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
#   cens <int>, days <int>, arms <int>

D. Before we can analyse the data, we need to make one important change. Specifically. we need to clarify that the arms column is a factor, not an integer. Run the following code to make this change.

# traial_act data cleaning

trial_act <- trial_act %>%
  mutate(
    
    # Convert arms to a factor
    
    arms = factor(arms, 
                  levels = c(0, 1, 2, 3),
                  labels = c("zid", "zid_did", "zid_zal", "did"))
  )
Warning: package 'bindrcpp' was built under R version 3.4.4

E. (Optional): If you’d like to, try converting other columns such as race and gender to factors with appropriate labels. Look at the help menu for the ACTG175 by running ?ACTG175 to see how these columns are coded.

Descriptive statistics

  1. What was the mean age (age) of all patients?
mean(trial_act$age)
[1] 35.24825

o 2. What was the median weight (wtkg) of all patients?

mean(trial_act$wtkg)
[1] 75.12531
  1. What was the mean CD4 T cell count at baseline (cd40) ? What was it at 20 weeks (cd420)?
mean(trial_act$cd40)
[1] 350.5012
mean(trial_act$cd420)
[1] 371.3072
  1. How many patients have a history of intravenous drug use (drugs) and how many do not? (Hint: use table())
table(trial_act$drugs)

   0    1 
1858  281 

T tests with t.test()

  1. Conduct a one-sample t-test comparing the age (age) of the patients versus a null hypothesis of 40 years.
t.test(x = trial_act$age, 
       mu = 40)

    One Sample t-test

data:  trial_act$age
t = -25.234, df = 2138, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
 34.87896 35.61753
sample estimates:
mean of x 
 35.24825 
  1. Now, compare the mean age to a null hypothesis of 35 years. What has changed?
t.test(x = trial_act$age,
       mu = 35)

    One Sample t-test

data:  trial_act$age
t = 1.3183, df = 2138, p-value = 0.1875
alternative hypothesis: true mean is not equal to 35
95 percent confidence interval:
 34.87896 35.61753
sample estimates:
mean of x 
 35.24825 
  1. A researcher wants to make sure that men and women in the clinical study are similar in terms of age. Conduct a two-sample t-test comparing the age of men versus women to test if they are indeed similar or not.
  • Women are coded as 0 in gender, and men are coded as 1.
  • Be sure to use the formula notation formula = age ~ gender
age_gender_ht <- t.test(formula = age ~ gender,
                           data = trial_act,
                           alternative = "two.sided")

age_gender_ht

    Welch Two Sample t-test

data:  age by gender
t = -2.3524, df = 554.82, p-value = 0.019
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.0618384 -0.1854089
sample estimates:
mean in group 0 mean in group 1 
       34.31793        35.44156 
  1. Do people with a history of intravenous drug use have different numbers of days than those who do not have a history of intravenous drug use? Answer this with a two-sample t-test.
t.test(formula = days ~ drugs,
       data = trial_act)

    Welch Two Sample t-test

data:  days by drugs
t = 1.0058, df = 368.61, p-value = 0.3152
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18.05634  55.86764
sample estimates:
mean in group 0 mean in group 1 
       881.5818        862.6762 

Correlation test with cor.test()

  1. Do heavier people tend to have fewer days until a major negative event? Conduct a correlation test between weight (wtkg) and age (days). What is your conclusion?
cor.test(formula = ~ days + wtkg,
         data = trial_act)

    Pearson's product-moment correlation

data:  days and wtkg
t = 0.42646, df = 2137, p-value = 0.6698
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.03317080  0.05158712
sample estimates:
        cor 
0.009224728 
  1. We would expect a correlation between CD4 T cell count at baseline (cd40) and at 20 weeks (cd420). But how strong is the correlation? Answer this question by conducting a correlation test between CD4 T cell count at baseline (cd40) and CD4 T cell count at 20 weeks (cd420).
cor.test(formula = ~ cd40 + cd420,
         data = trial_act)

    Pearson's product-moment correlation

data:  cd40 and cd420
t = 33.221, df = 2137, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5549209 0.6108523
sample estimates:
      cor 
0.5835783 
  1. Only considering men, is there a correlation between CD4 T cell count at baseline (cd40)and CD8 T cell count at baseline (cd80)? (Hint: Include the argument subset = gender == 0 to restrict the analysis to men)
cor.test(formula = ~ cd40 + cd80,
         data = trial_act,
         subset = gender == 0)

    Pearson's product-moment correlation

data:  cd40 and cd80
t = 5.3518, df = 366, p-value = 1.54e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1719038 0.3616702
sample estimates:
      cor 
0.2694002 
  1. Now, repeat the previous test, but only for women. How do the results compare?
cor.test(formula = ~ cd40 + cd80,
         data = trial_act,
         subset = gender == 1)

    Pearson's product-moment correlation

data:  cd40 and cd80
t = 8.8997, df = 1769, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1619967 0.2511713
sample estimates:
      cor 
0.2070139 

Chi-square test with chisq.test()

  1. Do men and women (gender) have different distributions of race (race)? That is, is the percentage of women who are white differ from the percentage of men who are white? Answer this with a chi-square test.
  • Be sure to create a table of gender and race values with table(trial_act$gender, trial_act$race)
chisq.test(table(trial_act$gender, 
                 trial_act$race))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(trial_act$gender, trial_act$race)
X-squared = 180.86, df = 1, p-value < 2.2e-16
  1. Is there a relationship between a history of intravenous drug use (drugs) and hemophilia (hemo)? Answer this by conducting a chi-square test.
chisq.test(table(trial_act$hemo, 
                 trial_act$drugs))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(trial_act$hemo, trial_act$drugs)
X-squared = 17.505, df = 1, p-value = 2.866e-05

Linear Regression with glm()

  1. Which demographic variables predict days until the first occurrence of a major negative event (days)? To answer this, conduct a linear regression analysis predicting days as a function of age (age), weight (wtkg), race (race) and gender (gender). Call the object days_A_glm.
days_A_glm <- glm(formula = days ~ age + wtkg + race + gender,
                     data = trial_act)
  1. Print the days_A_glm object to see the coefficients.
days_A_glm

Call:  glm(formula = days ~ age + wtkg + race + gender, data = trial_act)

Coefficients:
(Intercept)          age         wtkg         race       gender  
  856.99143      0.72094      0.02547    -30.59783      4.35692  

Degrees of Freedom: 2138 Total (i.e. Null);  2134 Residual
Null Deviance:      182600000 
Residual Deviance: 182100000    AIC: 30360
  1. Using the summary() function, explore the days_A_glm object to see which coefficients are significant.
summary(days_A_glm)

Call:
glm(formula = days ~ age + wtkg + race + gender, data = trial_act)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-870.6  -160.6   120.7   211.6   370.8  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 856.99143   43.60167  19.655   <2e-16 ***
age           0.72094    0.73470   0.981   0.3266    
wtkg          0.02547    0.49450   0.052   0.9589    
race        -30.59783   14.63215  -2.091   0.0366 *  
gender        4.35692   17.96571   0.243   0.8084    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 85316.48)

    Null deviance: 182637111  on 2138  degrees of freedom
Residual deviance: 182065366  on 2134  degrees of freedom
AIC: 30364

Number of Fisher Scoring iterations: 2
  1. Using the names() function, look at what the named elements in the days_A_glm object are.
names(days_A_glm)
 [1] "coefficients"      "residuals"         "fitted.values"    
 [4] "effects"           "R"                 "rank"             
 [7] "qr"                "family"            "linear.predictors"
[10] "deviance"          "aic"               "null.deviance"    
[13] "iter"              "weights"           "prior.weights"    
[16] "df.residual"       "df.null"           "y"                
[19] "converged"         "boundary"          "model"            
[22] "call"              "formula"           "terms"            
[25] "data"              "offset"            "control"          
[28] "method"            "contrasts"         "xlevels"          
  1. Using the days_A_glm$ notation, print a vector of the model coefficients.
days_A_glm$coefficients
 (Intercept)          age         wtkg         race       gender 
856.99143293   0.72093510   0.02547379 -30.59783308   4.35691955 
  1. Look at the $data, $fitted.values and $residuals elements of the days_A_glm object. What do you think these are? (Bonus: can you calculate the mean squared residual?)
# Mean squared residual
mean_squared_residual <- mean(days_A_glm$residuals ^ 2)
  1. Now create a new regression object days_B_glm where you model days as a function of the treatment arm (arms). Explore the object. Do you find that there was an effect of the treatment arm on days?
days_B_glm <- glm(formula = days ~ arms,
                  data = trial_act)


summary(days_B_glm)

Call:
glm(formula = days ~ arms, data = trial_act)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-879.5  -158.7   106.2   203.6   429.8  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   801.24      12.53  63.967  < 2e-16 ***
armszid_did   115.00      17.80   6.461 1.28e-10 ***
armszid_zal   104.52      17.78   5.878 4.81e-09 ***
armsdid        92.24      17.48   5.276 1.45e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 83467.7)

    Null deviance: 182637111  on 2138  degrees of freedom
Residual deviance: 178203547  on 2135  degrees of freedom
AIC: 30316

Number of Fisher Scoring iterations: 2
  1. Does the effect of treatment arm on days hold if you only consider men (gender == 1) above the age of 30 (age > 30)? To answer this, repeat your previous regression, but include the appropriate arguments to subset
days_B_glm <- glm(formula = days ~ arms,
                  data = trial_act,
                  subset = age > 30)


summary(days_B_glm)

Call:
glm(formula = days ~ arms, data = trial_act, subset = age > 30)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-877.0  -111.9   101.4   195.0   422.4  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   808.56      14.97  54.030  < 2e-16 ***
armszid_did   122.66      21.28   5.764 1.00e-08 ***
armszid_zal   113.40      21.25   5.336 1.10e-07 ***
armsdid        88.33      20.91   4.223 2.56e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 81967.46)

    Null deviance: 123421566  on 1467  degrees of freedom
Residual deviance: 120000363  on 1464  degrees of freedom
AIC: 20781

Number of Fisher Scoring iterations: 2
  1. Now create a regression object days_C_glm that predicts days based on all variables in the dataset. Explore the object. What do you find?
days_C_glm <- glm(formula = days ~ .,
                  data = trial_act)


summary(days_C_glm)

Call:
glm(formula = days ~ ., data = trial_act)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-670.09   -55.52     9.28    66.64   427.23  

Coefficients: (3 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.116e+02  7.310e+01  12.472  < 2e-16 ***
pidnum      -4.446e-05  2.060e-05  -2.158  0.03110 *  
age          1.105e-01  4.508e-01   0.245  0.80642    
wtkg         4.832e-01  3.160e-01   1.529  0.12641    
hemo         2.766e+01  2.052e+01   1.348  0.17797    
homo        -2.072e+00  1.340e+01  -0.155  0.87710    
drugs       -1.354e+01  1.217e+01  -1.113  0.26601    
karnof      -4.481e-01  6.815e-01  -0.658  0.51093    
oprior      -1.080e+00  2.703e+01  -0.040  0.96813    
z30          2.414e+01  1.915e+01   1.261  0.20768    
zprior              NA         NA      NA       NA    
preanti     -2.098e-02  1.628e-02  -1.289  0.19777    
race        -7.649e+00  9.218e+00  -0.830  0.40683    
gender       1.120e+01  1.530e+01   0.732  0.46434    
str2        -3.384e+01  2.622e+01  -1.291  0.19703    
strat        3.955e+01  1.542e+01   2.565  0.01042 *  
symptom      8.297e+00  1.003e+01   0.827  0.40850    
treat        1.781e+01  1.057e+01   1.684  0.09233 .  
offtrt      -8.354e+01  9.613e+00  -8.690  < 2e-16 ***
cd40        -6.233e-03  4.526e-02  -0.138  0.89049    
cd420        5.820e-02  4.160e-02   1.399  0.16212    
cd496        1.764e-01  3.436e-02   5.134 3.26e-07 ***
r                   NA         NA      NA       NA    
cd80        -3.946e-02  1.333e-02  -2.961  0.00312 ** 
cd820        3.618e-02  1.417e-02   2.553  0.01080 *  
cens        -3.799e+02  1.051e+01 -36.130  < 2e-16 ***
armszid_did  6.232e+00  1.041e+01   0.599  0.54958    
armszid_zal  1.540e+01  1.035e+01   1.489  0.13676    
armsdid             NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 18020.86)

    Null deviance: 74984714  on 1341  degrees of freedom
Residual deviance: 23715452  on 1316  degrees of freedom
  (797 observations deleted due to missingness)
AIC: 16987

Number of Fisher Scoring iterations: 2
  1. Create a fictional dataset of 10 new patients called trial_new by randomly selecting 10 patients from the original dataset. You can do this using the following code. After you run it, print the trial_new object to see how it looks
# Draw 10 patients at random from trial_act
#  and store as trial_new

trial_new <- trial_act %>%
  sample_n(10)

trial_new
# A tibble: 10 x 27
   pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti
    <int> <int> <dbl> <int> <int> <int>  <int>  <int> <int>  <int>   <int>
 1 250464    47  88.0     0     1     0     80      0     1      1     322
 2 231071    34 101.      0     1     0    100      0     0      1       0
 3  50236    31  83.0     0     1     0    100      0     0      1       0
 4 110581    37  47.8     1     0     0    100      1     0      1     391
 5  90109    45  54.0     0     0     0     90      0     1      1     479
 6 320391    31  56.3     0     1     1     90      0     1      1     510
 7 220492    37  78.5     0     1     0    100      0     0      1       0
 8 540021    43  79.8     0     0     1     90      0     0      1       0
 9 110697    31  90.4     0     1     0    100      0     1      1     976
10 300507    43  97.5     0     0     0    100      0     0      1       0
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
#   strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
#   cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
#   cens <int>, days <int>, arms <fct>
  1. Now, using days_C_glm, predict the number of days until a major negative event for these 10 randomly selected patients using the following template:
# Predict days until a major negative event from the trial_new data
#   using the days_C_glm object

predict(object = ___, 
        newdata = ___)
predict(days_B_glm, 
        newdata = trial_new)
       1        2        3        4        5        6        7        8 
931.2235 896.8880 921.9583 808.5601 896.8880 896.8880 808.5601 931.2235 
       9       10 
808.5601 808.5601 

Logistic regression

  1. You can calculate a logistic regression, where the dependent variable is binary (0 or 1) by including the family = 'binomial' argument to glm(). Using the template below, calculate a logistic regression predicting whether or not someone has taken intravenous drugs based on their age (age), gender (gender) and (race). Then, explore the object to see if any variables reliably predict drug use.
# Logistic regression predicting intraveneous drug use
#  based on age, gender, and race

drugs_glm_binom <- glm(formula = ___,
                       data = ___,
                       family = "___")
# Logistic regression predicting intraveneous drug use
#  based on age, gender, and race

drugs_glm_binom <- glm(formula = drugs ~ age + gender + race,
                 data = trial_act,
                 family = "binomial")

summary(drugs_glm_binom)

Call:
glm(formula = drugs ~ age + gender + race, family = "binomial", 
    data = trial_act)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0597  -0.5323  -0.4677  -0.4113   2.3384  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.41495    0.30425  -7.937 2.07e-15 ***
age          0.02994    0.00721   4.152 3.29e-05 ***
gender      -0.85074    0.15387  -5.529 3.22e-08 ***
race         0.33542    0.14367   2.335   0.0196 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1664.1  on 2138  degrees of freedom
Residual deviance: 1605.9  on 2135  degrees of freedom
AIC: 1613.9

Number of Fisher Scoring iterations: 4
  1. A scientist wants to know whether there was an effect of treatment arm on the probability that patients had more than 1000 days before the first occurrence of a major negative event. Answer this question using a logistic regression. (Hint: To do this, you should first create a new column called days_1000_bin that is a binary variable indicating whether or not days > 1000. You can do this using mutate())
days_glm_binom <- glm(formula = days_1000_bin ~ arms,
                       data = trial_act %>%
                                mutate(days_1000_bin = days > 1000),
                        family = "binomial")

summary(days_glm_binom)

Call:
glm(formula = days_1000_bin ~ arms, family = "binomial", data = trial_act %>% 
    mutate(days_1000_bin = days > 1000))

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.236  -1.173  -1.021   1.126   1.343  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.38046    0.08828  -4.310 1.64e-05 ***
armszid_did  0.50322    0.12444   4.044 5.26e-05 ***
armszid_zal  0.51809    0.12435   4.166 3.10e-05 ***
armsdid      0.36977    0.12217   3.027  0.00247 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2964.7  on 2138  degrees of freedom
Residual deviance: 2941.9  on 2135  degrees of freedom
AIC: 2949.9

Number of Fisher Scoring iterations: 4

Generating random samples from distributions

  1. You can easily generate random samples from statistical distributions in R. To see all of them, run ?distributions. For example, to generate samples from the well known Normal distribution, you can use rnorm(). Look at the help menu for rnorm() to see its arguments.

  2. Using rnorm(),create a new object samp_10 which is 10 samples from a Normal distribution with mean 10 and standard deviation 5. Print the object to see what the elements look like. What should the mean and standard deviation of this sample? be? Test it by evaluating its mean and standard deviation directly using the appropriate functions. Then, do a one-sample t-test on this sample against the null hypothesis that the true mean is 12. What are the results?

# Note that your results might be different due to random chance!

samp_10 <- rnorm(n = 10, mean = 10, sd = 5)

samp_10
 [1]  6.452053 14.623421 16.391728 11.752879  4.435956  7.402795  8.280792
 [8] 23.155857  6.655332  5.163521
mean(samp_10)
[1] 10.43143
sd(samp_10)
[1] 6.003185
t.test(x = samp_10, mu = 12)

    One Sample t-test

data:  samp_10
t = -0.82627, df = 9, p-value = 0.43
alternative hypothesis: true mean is not equal to 12
95 percent confidence interval:
  6.137014 14.725853
sample estimates:
mean of x 
 10.43143 
  1. Look at the code below. Before you run it, what do you expect the value of the coefficients of the regression equation will be? Test your prediction by running the code and exploring the my_glm object.
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)

# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)

# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise

# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)

# Calculate glm
my_glm <- glm(formula = y ~.,
              data = my_data)
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)

# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)

# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise

# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)

# Calculate glm
my_glm <- glm(formula = y ~.,
              data = my_data)

summary(my_glm)

Call:
glm(formula = y ~ ., data = my_data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.42475  -0.77862  -0.07247   0.69691   2.35518  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 99.46598    1.15244   86.31   <2e-16 ***
x1           3.05026    0.11391   26.78   <2e-16 ***
x2           1.99367    0.01071  186.18   <2e-16 ***
x3          -5.03939    0.02343 -215.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.187054)

    Null deviance: 84864.07  on 99  degrees of freedom
Residual deviance:   113.96  on 96  degrees of freedom
AIC: 306.85

Number of Fisher Scoring iterations: 2
  1. Adjust the code above so that the coefficients for the regression equation will be (Intercept) = -50, x1 = -3, x2 = 10, x3 = 15
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)

# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)

# Create dependent variable
y <- -3 * x1 + 10 * x2 - 15 * x3 + -50 + noise

# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)

# Calculate glm
my_glm <- glm(formula = y ~.,
              data = my_data)

summary(my_glm)

Call:
glm(formula = y ~ ., data = my_data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.55143  -0.73330   0.04009   0.69817   3.02294  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -49.44544    1.20587  -41.00   <2e-16 ***
x1           -3.09004    0.11184  -27.63   <2e-16 ***
x2           10.00571    0.01206  829.64   <2e-16 ***
x3          -15.03252    0.02289 -656.80   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.250026)

    Null deviance: 1255498  on 99  degrees of freedom
Residual deviance:     120  on 96  degrees of freedom
AIC: 312.02

Number of Fisher Scoring iterations: 2

Additional reading