Source: https://cdn.lynda.com/course/495322/495322-636154038826424503-16x9.jpg

Source: https://cdn.lynda.com/course/495322/495322-636154038826424503-16x9.jpg

Slides

Overview

In this practical you’ll do basic statistics in R. By the end of this practical you will know how to:

  1. Calculate basic descriptive statistics using mean(), median(), table().
  2. Conduct 1 and 2 sample hypothesis tests with t.test(), cor.test() and chisq.test()
  3. Calculate regression analyses with glm()
  4. Explore statistical objects with names(), summary(), print(), predict()
  5. Use sampling functions such as rnorm() to conduct simulations

Glossary and Packages

Here are the main descriptive statistics functions we will be covering.

Function Description
table() Frequency table
mean(), median(), mode() Measures of central tendency
sd(), range(), var() Measures of variability
max(), min() Extreme values
summary() Several summary statistics

Here are the main statistical functions we will be covering.

Function Hypothesis Test Additional Help
t.test() One and two sample t-test https://bookdown.org/ndphillips/YaRrr/htests.html#t-test-t.test
cor.test() Correlation test https://bookdown.org/ndphillips/YaRrr/htests.html#correlation-cor.test
chisq.test() Chi-Square test https://bookdown.org/ndphillips/YaRrr/htests.html#chi-square-chsq.test
glm() Regression https://bookdown.org/ndphillips/YaRrr/regression.html

Here are the random sampling functions we will use

Function Description Additional Help
sample() Draw a random sample of values from a vector ?sample
rnorm() Draw random values from a Normal distribution ?rnorm()
runif Draw random values from a Uniform distribution ?runif()

Examples

# -----------------------------------------------
# Examples of hypothesis tests on the diamonds
# ------------------------------------------------
library(tidyverse)

# First few rows of the diamonds data

diamonds

# -----
# Descriptive statistics
# -----

mean(diamonds$carat)   # What is the mean carat?
median(diamonds$price)   # What is the median price?
max(diamonds$depth)    # What is the maximum depth?
table(diamonds$color)    # How many observations for each color?

# -----
# 1-sample hypothesis test
# -----

# Q: Is the mean carat of diamonds different from .50?

htest_A <- t.test(x = diamonds$carat,     # The data
                  alternative = "two.sided",  # Two-sided test
                  mu = 0.5)                   # The null hyopthesis

htest_A            # Print result
names(htest_A)     # See all attributes in object
htest_A$statistic  # Get just the test statistic
htest_A$p.value    # Get the p-value
htest_A$conf.int   # Get a confidence interval

# -----
# 2-sample hypothesis test
# -----

# Q: Is there a difference in the carats of color = E and color = I diamonds?

htest_B <- t.test(formula = carat ~ color,     # DV ~ IV
                   alternative = "two.sided",  # Two-sided test
                   data = diamonds,         # The data
                   subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2

htest_B  # Print result

# -----
# Correlation test
# ------

# Q: Is there a correlation between carat and price?

htest_C <- cor.test(formula = ~ carat + price,
                    data = diamonds)

htest_C

# A: Yes. r = 0.92, t(53938) = 551.51, p < .001

# -----
# Regression
# ------

# Q: Create regression equation predicting price by carat, depth, table, and x

price_glm <- glm(formula = price ~ carat + depth + table + x,
                  data = diamonds)

# Print coefficients
price_glm

# Q: Predict the price of the following 3 diamonds:

diamonds_new <- tibble(carat = c(.2, .6, .5),
                       depth = c(50, 46, 90),
                       table = c(40, 65, 70),
                       x = c(3.7, 4.2, 4.3))

predict(price_glm, 
        newdata = diamonds_new)

#       1         2         3 
# 4190.453  6088.398 -4471.817 


# -----
# Simulation
# ------

# 100 random samples from a normal distribution with mean = 0, sd = 1
samp_A <- rnorm(n = 100, mean = 0, sd = 1)

# 100 random samples from a Uniform distribution with bounds at 0, 10
samp_B <- runif(n = 100, min = 0, max = 10)

# Calculate descriptives
mean(samp_A)
sd(samp_A)

mean(samp_B)
sd(samp_B)

Tasks

Data

You’ll use one dataset in this practical: ACTG175.csv. It is available in the data_BaselRBootcamp_Day2.zip file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip folder and unzip it to get the ACTG175.csv file.

Getting setup

A. Open your BaselRBootcamp project. This project should have the folders R and data in the working directory. The ACTG175.csv file should already be in the data folder. (If it isn’t, you’ll need to get it from the data_BaselRBootcamp_Day2.zip folder and move it to the data folder).

A. Open a new R script and save it in the R folder you just created as a new file called statistics_practical.R. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse and speff2trial packages with library(). Here’s how the top of your script should look:

## NAME
## DATE
## Statistics Practical

library(tidyverse)     # For tidyverse
library(speff2trial)   # For the ACTG175 data documentation

B. For this practical, we’ll use the ACTG175 data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act.

# Load ACTG175.csv from the data folder 
# The ACTG175.csv file MUST be in a folder called data in your working directory!!

trial_act <- read_csv(file = "data/ACTG175.csv")

C. First thing’s first, take a look at the data by printing it. It should look like this

# A tibble: 6 x 27
  pidnum   age  wtkg  hemo  homo drugs karnof oprior   z30 zprior preanti
   <int> <int> <dbl> <int> <int> <int>  <int>  <int> <int>  <int>   <int>
1  10056    48  89.8     0     0     0    100      0     0      1       0
2  10059    61  49.4     0     0     0     90      0     1      1     895
3  10089    45  88.5     0     1     1     90      0     1      1     707
4  10093    47  85.3     0     1     0    100      0     1      1    1399
5  10124    43  66.7     0     1     0    100      0     1      1    1352
6  10140    46  88.9     0     1     1    100      0     1      1    1181
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
#   strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
#   cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
#   cens <int>, days <int>, arms <int>

D. Before we can analyse the data, we need to make one important change. Specifically. we need to clarify that the arms column is a factor, not an integer. Run the following code to make this change.

# traial_act data cleaning

trial_act <- trial_act %>%
  mutate(
    
    # Convert arms to a factor
    
    arms = factor(arms, 
                  levels = c(0, 1, 2, 3),
                  labels = c("zid", "zid_did", "zid_zal", "did"))
  )

E. (Optional): If you’d like to, try converting other columns such as race and gender to factors with appropriate labels. Look at the help menu for the ACTG175 by running ?ACTG175 to see how these columns are coded.

Descriptive statistics

  1. What was the mean age (age) of all patients?

  2. What was the median weight (wtkg) of all patients?

  3. What was the mean CD4 T cell count at baseline (cd40) ? What was it at 20 weeks (cd420)?

  4. How many patients have a history of intravenous drug use (drugs) and how many do not? (Hint: use table())

T tests with t.test()

  1. Conduct a one-sample t-test comparing the age (age) of the patients versus a null hypothesis of 40 years.

  2. Now, compare the mean age to a null hypothesis of 35 years. What has changed?

  3. A researcher wants to make sure that men and women in the clinical study are similar in terms of age. Conduct a two-sample t-test comparing the age of men versus women to test if they are indeed similar or not.

  • Women are coded as 0 in gender, and men are coded as 1.
  • Be sure to use the formula notation formula = age ~ gender
  1. Do people with a history of intravenous drug use have different numbers of days than those who do not have a history of intravenous drug use? Answer this with a two-sample t-test.

Correlation test with cor.test()

  1. Do heavier people tend to have fewer days until a major negative event? Conduct a correlation test between weight (wtkg) and age (days). What is your conclusion?

  2. We would expect a correlation between CD4 T cell count at baseline (cd40) and at 20 weeks (cd420). But how strong is the correlation? Answer this question by conducting a correlation test between CD4 T cell count at baseline (cd40) and CD4 T cell count at 20 weeks (cd420).

  3. Only considering men, is there a correlation between CD4 T cell count at baseline (cd40)and CD8 T cell count at baseline (cd80)? (Hint: Include the argument subset = gender == 0 to restrict the analysis to men)

  4. Now, repeat the previous test, but only for women. How do the results compare?

Chi-square test with chisq.test()

  1. Do men and women (gender) have different distributions of race (race)? That is, is the percentage of women who are white differ from the percentage of men who are white? Answer this with a chi-square test.
  • Be sure to create a table of gender and race values with table(trial_act$gender, trial_act$race)
  1. Is there a relationship between a history of intravenous drug use (drugs) and hemophilia (hemo)? Answer this by conducting a chi-square test.

Linear Regression with glm()

  1. Which demographic variables predict days until the first occurrence of a major negative event (days)? To answer this, conduct a linear regression analysis predicting days as a function of age (age), weight (wtkg), race (race) and gender (gender). Call the object days_A_glm.

  2. Print the days_A_glm object to see the coefficients.

  3. Using the summary() function, explore the days_A_glm object to see which coefficients are significant.

  4. Using the names() function, look at what the named elements in the days_A_glm object are.

  5. Using the days_A_glm$ notation, print a vector of the model coefficients.

  6. Look at the $data, $fitted.values and $residuals elements of the days_A_glm object. What do you think these are? (Bonus: can you calculate the mean squared residual?)

  7. Now create a new regression object days_B_glm where you model days as a function of the treatment arm (arms). Explore the object. Do you find that there was an effect of the treatment arm on days?

  8. Does the effect of treatment arm on days hold if you only consider men (gender == 1) above the age of 30 (age > 30)? To answer this, repeat your previous regression, but include the appropriate arguments to subset

  9. Now create a regression object days_C_glm that predicts days based on all variables in the dataset. Explore the object. What do you find?

  10. Create a fictional dataset of 10 new patients called trial_new by randomly selecting 10 patients from the original dataset. You can do this using the following code. After you run it, print the trial_new object to see how it looks

# Draw 10 patients at random from trial_act
#  and store as trial_new

trial_new <- trial_act %>%
  sample_n(10)

trial_new
  1. Now, using days_C_glm, predict the number of days until a major negative event for these 10 randomly selected patients using the following template:
# Predict days until a major negative event from the trial_new data
#   using the days_C_glm object

predict(object = ___, 
        newdata = ___)

Logistic regression

  1. You can calculate a logistic regression, where the dependent variable is binary (0 or 1) by including the family = 'binomial' argument to glm(). Using the template below, calculate a logistic regression predicting whether or not someone has taken intravenous drugs based on their age (age), gender (gender) and (race). Then, explore the object to see if any variables reliably predict drug use.
# Logistic regression predicting intraveneous drug use
#  based on age, gender, and race

drugs_glm_binom <- glm(formula = ___,
                       data = ___,
                       family = "___")
  1. A scientist wants to know whether there was an effect of treatment arm on the probability that patients had more than 1000 days before the first occurrence of a major negative event. Answer this question using a logistic regression. (Hint: To do this, you should first create a new column called days_1000_bin that is a binary variable indicating whether or not days > 1000. You can do this using mutate())

Generating random samples from distributions

  1. You can easily generate random samples from statistical distributions in R. To see all of them, run ?distributions. For example, to generate samples from the well known Normal distribution, you can use rnorm(). Look at the help menu for rnorm() to see its arguments.

  2. Using rnorm(),create a new object samp_10 which is 10 samples from a Normal distribution with mean 10 and standard deviation 5. Print the object to see what the elements look like. What should the mean and standard deviation of this sample? be? Test it by evaluating its mean and standard deviation directly using the appropriate functions. Then, do a one-sample t-test on this sample against the null hypothesis that the true mean is 12. What are the results?

  3. Look at the code below. Before you run it, what do you expect the value of the coefficients of the regression equation will be? Test your prediction by running the code and exploring the my_glm object.

# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)

# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)

# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise

# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)

# Calculate glm
my_glm <- glm(formula = y ~.,
              data = my_data)
  1. Adjust the code above so that the coefficients for the regression equation will be (Intercept) = -50, x1 = -3, x2 = 10, x3 = 15
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)

# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)

# Create dependent variable
y <- -3 * x1 + 10 * x2 - 15 * x3 + -50 + noise

# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)

# Calculate glm
my_glm <- glm(formula = y ~.,
              data = my_data)

summary(my_glm)

Call:
glm(formula = y ~ ., data = my_data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.66213  -0.55951  -0.01822   0.68138   1.99547  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -49.92642    1.08431  -46.05   <2e-16 ***
x1           -2.99908    0.10342  -29.00   <2e-16 ***
x2            9.98863    0.00848 1177.93   <2e-16 ***
x3          -15.02720    0.01808 -831.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.7824771)

    Null deviance: 2.0680e+06  on 99  degrees of freedom
Residual deviance: 7.5118e+01  on 96  degrees of freedom
AIC: 265.18

Number of Fisher Scoring iterations: 2

Additional reading