In this practical you’ll do basic statistics in R. By the end of this practical you will know how to:
mean()
, median()
, table()
.t.test()
, cor.test()
and chisq.test()
glm()
names()
, summary()
, print()
, predict()
rnorm()
to conduct simulationsHere are the main descriptive statistics functions we will be covering.
Function | Description |
---|---|
table() |
Frequency table |
mean(), median(), mode() |
Measures of central tendency |
sd(), range(), var() |
Measures of variability |
max(), min() |
Extreme values |
summary() |
Several summary statistics |
Here are the main statistical functions we will be covering.
Function | Hypothesis Test | Additional Help |
---|---|---|
t.test() |
One and two sample t-test | https://bookdown.org/ndphillips/YaRrr/htests.html#t-test-t.test |
cor.test() |
Correlation test | https://bookdown.org/ndphillips/YaRrr/htests.html#correlation-cor.test |
chisq.test() |
Chi-Square test | https://bookdown.org/ndphillips/YaRrr/htests.html#chi-square-chsq.test |
glm() |
Regression | https://bookdown.org/ndphillips/YaRrr/regression.html |
Here are the random sampling functions we will use
Function | Description | Additional Help |
---|---|---|
sample() |
Draw a random sample of values from a vector | ?sample |
rnorm() |
Draw random values from a Normal distribution | ?rnorm() |
runif |
Draw random values from a Uniform distribution | ?runif() |
# -----------------------------------------------
# Examples of hypothesis tests on the diamonds
# ------------------------------------------------
library(tidyverse)
# First few rows of the diamonds data
diamonds
# -----
# Descriptive statistics
# -----
mean(diamonds$carat) # What is the mean carat?
median(diamonds$price) # What is the median price?
max(diamonds$depth) # What is the maximum depth?
table(diamonds$color) # How many observations for each color?
# -----
# 1-sample hypothesis test
# -----
# Q: Is the mean carat of diamonds different from .50?
htest_A <- t.test(x = diamonds$carat, # The data
alternative = "two.sided", # Two-sided test
mu = 0.5) # The null hyopthesis
htest_A # Print result
names(htest_A) # See all attributes in object
htest_A$statistic # Get just the test statistic
htest_A$p.value # Get the p-value
htest_A$conf.int # Get a confidence interval
# -----
# 2-sample hypothesis test
# -----
# Q: Is there a difference in the carats of color = E and color = I diamonds?
htest_B <- t.test(formula = carat ~ color, # DV ~ IV
alternative = "two.sided", # Two-sided test
data = diamonds, # The data
subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2
htest_B # Print result
# -----
# Correlation test
# ------
# Q: Is there a correlation between carat and price?
htest_C <- cor.test(formula = ~ carat + price,
data = diamonds)
htest_C
# A: Yes. r = 0.92, t(53938) = 551.51, p < .001
# -----
# Regression
# ------
# Q: Create regression equation predicting price by carat, depth, table, and x
price_glm <- glm(formula = price ~ carat + depth + table + x,
data = diamonds)
# Print coefficients
price_glm
# Q: Predict the price of the following 3 diamonds:
diamonds_new <- tibble(carat = c(.2, .6, .5),
depth = c(50, 46, 90),
table = c(40, 65, 70),
x = c(3.7, 4.2, 4.3))
predict(price_glm,
newdata = diamonds_new)
# 1 2 3
# 4190.453 6088.398 -4471.817
# -----
# Simulation
# ------
# 100 random samples from a normal distribution with mean = 0, sd = 1
samp_A <- rnorm(n = 100, mean = 0, sd = 1)
# 100 random samples from a Uniform distribution with bounds at 0, 10
samp_B <- runif(n = 100, min = 0, max = 10)
# Calculate descriptives
mean(samp_A)
sd(samp_A)
mean(samp_B)
sd(samp_B)
You’ll use one dataset in this practical: ACTG175.csv
. It is available in the data_BaselRBootcamp_Day2.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip
folder and unzip it to get the ACTG175.csv
file.
A. Open your BaselRBootcamp
project. This project should have the folders R
and data
in the working directory. The ACTG175.csv
file should already be in the data
folder. (If it isn’t, you’ll need to get it from the data_BaselRBootcamp_Day2.zip
folder and move it to the data
folder).
A. Open a new R script and save it in the R
folder you just created as a new file called statistics_practical.R
. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse
and speff2trial
packages with library()
. Here’s how the top of your script should look:
## NAME
## DATE
## Statistics Practical
library(tidyverse) # For tidyverse
library(speff2trial) # For the ACTG175 data documentation
B. For this practical, we’ll use the ACTG175
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load ACTG175.csv from the data folder
# The ACTG175.csv file MUST be in a folder called data in your working directory!!
trial_act <- read_csv(file = "data/ACTG175.csv")
C. First thing’s first, take a look at the data by printing it. It should look like this
# A tibble: 6 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 10056 48 89.8 0 0 0 100 0 0 1 0
2 10059 61 49.4 0 0 0 90 0 1 1 895
3 10089 45 88.5 0 1 1 90 0 1 1 707
4 10093 47 85.3 0 1 0 100 0 1 1 1399
5 10124 43 66.7 0 1 0 100 0 1 1 1352
6 10140 46 88.9 0 1 1 100 0 1 1 1181
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
# strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
# cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
# cens <int>, days <int>, arms <int>
D. Before we can analyse the data, we need to make one important change. Specifically. we need to clarify that the arms
column is a factor, not an integer. Run the following code to make this change.
# traial_act data cleaning
trial_act <- trial_act %>%
mutate(
# Convert arms to a factor
arms = factor(arms,
levels = c(0, 1, 2, 3),
labels = c("zid", "zid_did", "zid_zal", "did"))
)
E. (Optional): If you’d like to, try converting other columns such as race
and gender
to factors with appropriate labels. Look at the help menu for the ACTG175
by running ?ACTG175
to see how these columns are coded.
What was the mean age (age
) of all patients?
What was the median weight (wtkg
) of all patients?
What was the mean CD4 T cell count at baseline (cd40
) ? What was it at 20 weeks (cd420
)?
How many patients have a history of intravenous drug use (drugs
) and how many do not? (Hint: use table()
)
Conduct a one-sample t-test comparing the age (age
) of the patients versus a null hypothesis of 40 years.
Now, compare the mean age to a null hypothesis of 35 years. What has changed?
A researcher wants to make sure that men and women in the clinical study are similar in terms of age. Conduct a two-sample t-test comparing the age of men versus women to test if they are indeed similar or not.
gender
, and men are coded as 1.formula = age ~ gender
days
than those who do not have a history of intravenous drug use? Answer this with a two-sample t-test.Do heavier people tend to have fewer days until a major negative event? Conduct a correlation test between weight (wtkg
) and age (days
). What is your conclusion?
We would expect a correlation between CD4 T cell count at baseline (cd40
) and at 20 weeks (cd420
). But how strong is the correlation? Answer this question by conducting a correlation test between CD4 T cell count at baseline (cd40
) and CD4 T cell count at 20 weeks (cd420
).
Only considering men, is there a correlation between CD4 T cell count at baseline (cd40
)and CD8 T cell count at baseline (cd80
)? (Hint: Include the argument subset = gender == 0
to restrict the analysis to men)
Now, repeat the previous test, but only for women. How do the results compare?
gender
) have different distributions of race (race
)? That is, is the percentage of women who are white differ from the percentage of men who are white? Answer this with a chi-square test.table(trial_act$gender, trial_act$race)
drugs
) and hemophilia (hemo
)? Answer this by conducting a chi-square test.Which demographic variables predict days until the first occurrence of a major negative event (days
)? To answer this, conduct a linear regression analysis predicting days
as a function of age (age
), weight (wtkg
), race (race
) and gender (gender
). Call the object days_A_glm
.
Print the days_A_glm
object to see the coefficients.
Using the summary()
function, explore the days_A_glm
object to see which coefficients are significant.
Using the names()
function, look at what the named elements in the days_A_glm
object are.
Using the days_A_glm$
notation, print a vector of the model coefficients.
Look at the $data
, $fitted.values
and $residuals
elements of the days_A_glm
object. What do you think these are? (Bonus: can you calculate the mean squared residual?)
Now create a new regression object days_B_glm
where you model days
as a function of the treatment arm (arms
). Explore the object. Do you find that there was an effect of the treatment arm on days
?
Does the effect of treatment arm on days hold if you only consider men (gender == 1
) above the age of 30 (age > 30
)? To answer this, repeat your previous regression, but include the appropriate arguments to subset
Now create a regression object days_C_glm
that predicts days
based on all variables in the dataset. Explore the object. What do you find?
Create a fictional dataset of 10 new patients called trial_new
by randomly selecting 10 patients from the original dataset. You can do this using the following code. After you run it, print the trial_new
object to see how it looks
# Draw 10 patients at random from trial_act
# and store as trial_new
trial_new <- trial_act %>%
sample_n(10)
trial_new
days_C_glm
, predict the number of days until a major negative event for these 10 randomly selected patients using the following template:# Predict days until a major negative event from the trial_new data
# using the days_C_glm object
predict(object = ___,
newdata = ___)
family = 'binomial'
argument to glm()
. Using the template below, calculate a logistic regression predicting whether or not someone has taken intravenous drugs based on their age (age
), gender (gender
) and (race
). Then, explore the object to see if any variables reliably predict drug use.# Logistic regression predicting intraveneous drug use
# based on age, gender, and race
drugs_glm_binom <- glm(formula = ___,
data = ___,
family = "___")
days_1000_bin
that is a binary variable indicating whether or not days > 1000
. You can do this using mutate()
)You can easily generate random samples from statistical distributions in R. To see all of them, run ?distributions
. For example, to generate samples from the well known Normal distribution, you can use rnorm()
. Look at the help menu for rnorm()
to see its arguments.
Using rnorm()
,create a new object samp_10
which is 10 samples from a Normal distribution with mean 10 and standard deviation 5. Print the object to see what the elements look like. What should the mean and standard deviation of this sample? be? Test it by evaluating its mean and standard deviation directly using the appropriate functions. Then, do a one-sample t-test on this sample against the null hypothesis that the true mean is 12. What are the results?
Look at the code below. Before you run it, what do you expect the value of the coefficients
of the regression equation will be? Test your prediction by running the code and exploring the my_glm
object.
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)
# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)
# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise
# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)
# Calculate glm
my_glm <- glm(formula = y ~.,
data = my_data)
(Intercept) = -50
, x1 = -3
, x2 = 10
, x3 = 15
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)
# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)
# Create dependent variable
y <- -3 * x1 + 10 * x2 - 15 * x3 + -50 + noise
# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)
# Calculate glm
my_glm <- glm(formula = y ~.,
data = my_data)
summary(my_glm)
Call:
glm(formula = y ~ ., data = my_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.66213 -0.55951 -0.01822 0.68138 1.99547
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.92642 1.08431 -46.05 <2e-16 ***
x1 -2.99908 0.10342 -29.00 <2e-16 ***
x2 9.98863 0.00848 1177.93 <2e-16 ***
x3 -15.02720 0.01808 -831.04 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.7824771)
Null deviance: 2.0680e+06 on 99 degrees of freedom
Residual deviance: 7.5118e+01 on 96 degrees of freedom
AIC: 265.18
Number of Fisher Scoring iterations: 2
For more details on hypothesis tests in R, check out the chapter on hypothesis tests in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link
For more advanced mixed level ANOVAs with random effects, consult the afex
and lmer
packages.
To do Bayesian versions of common hypothesis tests, try using the BayesFactor
package. BayesFactor Guide Link