In this practical you’ll do basic statistics in R. By the end of this practical you will know how to:
mean()
, median()
, table()
.t.test()
, cor.test()
and chisq.test()
glm()
names()
, summary()
, print()
, predict()
rnorm()
to conduct simulationsHere are the main descriptive statistics functions we will be covering.
Function | Description |
---|---|
table() |
Frequency table |
mean(), median(), mode() |
Measures of central tendency |
sd(), range(), var() |
Measures of variability |
max(), min() |
Extreme values |
summary() |
Several summary statistics |
Here are the main statistical functions we will be covering.
Function | Hypothesis Test | Additional Help |
---|---|---|
t.test() |
One and two sample t-test | https://bookdown.org/ndphillips/YaRrr/htests.html#t-test-t.test |
cor.test() |
Correlation test | https://bookdown.org/ndphillips/YaRrr/htests.html#correlation-cor.test |
chisq.test() |
Chi-Square test | https://bookdown.org/ndphillips/YaRrr/htests.html#chi-square-chsq.test |
glm() |
Regression | https://bookdown.org/ndphillips/YaRrr/regression.html |
Here are the random sampling functions we will use
Function | Description | Additional Help |
---|---|---|
sample() |
Draw a random sample of values from a vector | ?sample |
rnorm() |
Draw random values from a Normal distribution | ?rnorm() |
runif |
Draw random values from a Uniform distribution | ?runif() |
# -----------------------------------------------
# Examples of hypothesis tests on the diamonds
# ------------------------------------------------
library(tidyverse)
# First few rows of the diamonds data
diamonds
# -----
# Descriptive statistics
# -----
mean(diamonds$carat) # What is the mean carat?
median(diamonds$price) # What is the median price?
max(diamonds$depth) # What is the maximum depth?
table(diamonds$color) # How many observations for each color?
# -----
# 1-sample hypothesis test
# -----
# Q: Is the mean carat of diamonds different from .50?
htest_A <- t.test(x = diamonds$carat, # The data
alternative = "two.sided", # Two-sided test
mu = 0.5) # The null hyopthesis
htest_A # Print result
names(htest_A) # See all attributes in object
htest_A$statistic # Get just the test statistic
htest_A$p.value # Get the p-value
htest_A$conf.int # Get a confidence interval
# -----
# 2-sample hypothesis test
# -----
# Q: Is there a difference in the carats of color = E and color = I diamonds?
htest_B <- t.test(formula = carat ~ color, # DV ~ IV
alternative = "two.sided", # Two-sided test
data = diamonds, # The data
subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2
htest_B # Print result
# -----
# Correlation test
# ------
# Q: Is there a correlation between carat and price?
htest_C <- cor.test(formula = ~ carat + price,
data = diamonds)
htest_C
# A: Yes. r = 0.92, t(53938) = 551.51, p < .001
# -----
# Regression
# ------
# Q: Create regression equation predicting price by carat, depth, table, and x
price_glm <- glm(formula = price ~ carat + depth + table + x,
data = diamonds)
# Print coefficients
price_glm
# Q: Predict the price of the following 3 diamonds:
diamonds_new <- tibble(carat = c(.2, .6, .5),
depth = c(50, 46, 90),
table = c(40, 65, 70),
x = c(3.7, 4.2, 4.3))
predict(price_glm,
newdata = diamonds_new)
# 1 2 3
# 4190.453 6088.398 -4471.817
# -----
# Simulation
# ------
# 100 random samples from a normal distribution with mean = 0, sd = 1
samp_A <- rnorm(n = 100, mean = 0, sd = 1)
# 100 random samples from a Uniform distribution with bounds at 0, 10
samp_B <- runif(n = 100, min = 0, max = 10)
# Calculate descriptives
mean(samp_A)
sd(samp_A)
mean(samp_B)
sd(samp_B)
You’ll use one dataset in this practical: ACTG175.csv
. It is available in the data_BaselRBootcamp_Day2.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip
folder and unzip it to get the ACTG175.csv
file.
A. Open your BaselRBootcamp
project. This project should have the folders R
and data
in the working directory. The ACTG175.csv
file should already be in the data
folder. (If it isn’t, you’ll need to get it from the data_BaselRBootcamp_Day2.zip
folder and move it to the data
folder).
# Done!
A. Open a new R script and save it in the R
folder you just created as a new file called statistics_practical.R
. At the top of the script, using comments, write your name and the date. The, load the set of tidyverse
and speff2trial
packages with library()
. Here’s how the top of your script should look:
## NAME
## DATE
## Statistics Practical
library(tidyverse) # For tidyverse
library(speff2trial) # For the ACTG175 data documentation
B. For this practical, we’ll use the ACTG175
data, this is the result of a randomized clinical trial comparing the effects of different medications on adults infected with the human immunodeficiency virus. Using the following template, load the data into R and store it as a new object called trial_act
.
# Load ACTG175.csv from the data folder
# The ACTG175.csv file MUST be in a folder called data in your working directory!!
trial_act <- read_csv(file = "data/ACTG175.csv")
trial_act <- as_tibble(trial_act)
C. First thing’s first, take a look at the data by printing it. It should look like this
# Print trial_act
head(trial_act)
# A tibble: 6 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 10056 48 89.8 0 0 0 100 0 0 1 0
2 10059 61 49.4 0 0 0 90 0 1 1 895
3 10089 45 88.5 0 1 1 90 0 1 1 707
4 10093 47 85.3 0 1 0 100 0 1 1 1399
5 10124 43 66.7 0 1 0 100 0 1 1 1352
6 10140 46 88.9 0 1 1 100 0 1 1 1181
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
# strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
# cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
# cens <int>, days <int>, arms <int>
D. Before we can analyse the data, we need to make one important change. Specifically. we need to clarify that the arms
column is a factor, not an integer. Run the following code to make this change.
# traial_act data cleaning
trial_act <- trial_act %>%
mutate(
# Convert arms to a factor
arms = factor(arms,
levels = c(0, 1, 2, 3),
labels = c("zid", "zid_did", "zid_zal", "did"))
)
Warning: package 'bindrcpp' was built under R version 3.4.4
E. (Optional): If you’d like to, try converting other columns such as race
and gender
to factors with appropriate labels. Look at the help menu for the ACTG175
by running ?ACTG175
to see how these columns are coded.
age
) of all patients?mean(trial_act$age)
[1] 35.24825
o 2. What was the median weight (wtkg
) of all patients?
mean(trial_act$wtkg)
[1] 75.12531
cd40
) ? What was it at 20 weeks (cd420
)?mean(trial_act$cd40)
[1] 350.5012
mean(trial_act$cd420)
[1] 371.3072
drugs
) and how many do not? (Hint: use table()
)table(trial_act$drugs)
0 1
1858 281
age
) of the patients versus a null hypothesis of 40 years.t.test(x = trial_act$age,
mu = 40)
One Sample t-test
data: trial_act$age
t = -25.234, df = 2138, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 40
95 percent confidence interval:
34.87896 35.61753
sample estimates:
mean of x
35.24825
t.test(x = trial_act$age,
mu = 35)
One Sample t-test
data: trial_act$age
t = 1.3183, df = 2138, p-value = 0.1875
alternative hypothesis: true mean is not equal to 35
95 percent confidence interval:
34.87896 35.61753
sample estimates:
mean of x
35.24825
gender
, and men are coded as 1.formula = age ~ gender
age_gender_ht <- t.test(formula = age ~ gender,
data = trial_act,
alternative = "two.sided")
age_gender_ht
Welch Two Sample t-test
data: age by gender
t = -2.3524, df = 554.82, p-value = 0.019
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.0618384 -0.1854089
sample estimates:
mean in group 0 mean in group 1
34.31793 35.44156
days
than those who do not have a history of intravenous drug use? Answer this with a two-sample t-test.t.test(formula = days ~ drugs,
data = trial_act)
Welch Two Sample t-test
data: days by drugs
t = 1.0058, df = 368.61, p-value = 0.3152
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-18.05634 55.86764
sample estimates:
mean in group 0 mean in group 1
881.5818 862.6762
wtkg
) and age (days
). What is your conclusion?cor.test(formula = ~ days + wtkg,
data = trial_act)
Pearson's product-moment correlation
data: days and wtkg
t = 0.42646, df = 2137, p-value = 0.6698
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.03317080 0.05158712
sample estimates:
cor
0.009224728
cd40
) and at 20 weeks (cd420
). But how strong is the correlation? Answer this question by conducting a correlation test between CD4 T cell count at baseline (cd40
) and CD4 T cell count at 20 weeks (cd420
).cor.test(formula = ~ cd40 + cd420,
data = trial_act)
Pearson's product-moment correlation
data: cd40 and cd420
t = 33.221, df = 2137, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5549209 0.6108523
sample estimates:
cor
0.5835783
cd40
)and CD8 T cell count at baseline (cd80
)? (Hint: Include the argument subset = gender == 0
to restrict the analysis to men)cor.test(formula = ~ cd40 + cd80,
data = trial_act,
subset = gender == 0)
Pearson's product-moment correlation
data: cd40 and cd80
t = 5.3518, df = 366, p-value = 1.54e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1719038 0.3616702
sample estimates:
cor
0.2694002
cor.test(formula = ~ cd40 + cd80,
data = trial_act,
subset = gender == 1)
Pearson's product-moment correlation
data: cd40 and cd80
t = 8.8997, df = 1769, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1619967 0.2511713
sample estimates:
cor
0.2070139
gender
) have different distributions of race (race
)? That is, is the percentage of women who are white differ from the percentage of men who are white? Answer this with a chi-square test.table(trial_act$gender, trial_act$race)
chisq.test(table(trial_act$gender,
trial_act$race))
Pearson's Chi-squared test with Yates' continuity correction
data: table(trial_act$gender, trial_act$race)
X-squared = 180.86, df = 1, p-value < 2.2e-16
drugs
) and hemophilia (hemo
)? Answer this by conducting a chi-square test.chisq.test(table(trial_act$hemo,
trial_act$drugs))
Pearson's Chi-squared test with Yates' continuity correction
data: table(trial_act$hemo, trial_act$drugs)
X-squared = 17.505, df = 1, p-value = 2.866e-05
days
)? To answer this, conduct a linear regression analysis predicting days
as a function of age (age
), weight (wtkg
), race (race
) and gender (gender
). Call the object days_A_glm
.days_A_glm <- glm(formula = days ~ age + wtkg + race + gender,
data = trial_act)
days_A_glm
object to see the coefficients.days_A_glm
Call: glm(formula = days ~ age + wtkg + race + gender, data = trial_act)
Coefficients:
(Intercept) age wtkg race gender
856.99143 0.72094 0.02547 -30.59783 4.35692
Degrees of Freedom: 2138 Total (i.e. Null); 2134 Residual
Null Deviance: 182600000
Residual Deviance: 182100000 AIC: 30360
summary()
function, explore the days_A_glm
object to see which coefficients are significant.summary(days_A_glm)
Call:
glm(formula = days ~ age + wtkg + race + gender, data = trial_act)
Deviance Residuals:
Min 1Q Median 3Q Max
-870.6 -160.6 120.7 211.6 370.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 856.99143 43.60167 19.655 <2e-16 ***
age 0.72094 0.73470 0.981 0.3266
wtkg 0.02547 0.49450 0.052 0.9589
race -30.59783 14.63215 -2.091 0.0366 *
gender 4.35692 17.96571 0.243 0.8084
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 85316.48)
Null deviance: 182637111 on 2138 degrees of freedom
Residual deviance: 182065366 on 2134 degrees of freedom
AIC: 30364
Number of Fisher Scoring iterations: 2
names()
function, look at what the named elements in the days_A_glm
object are.names(days_A_glm)
[1] "coefficients" "residuals" "fitted.values"
[4] "effects" "R" "rank"
[7] "qr" "family" "linear.predictors"
[10] "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights"
[16] "df.residual" "df.null" "y"
[19] "converged" "boundary" "model"
[22] "call" "formula" "terms"
[25] "data" "offset" "control"
[28] "method" "contrasts" "xlevels"
days_A_glm$
notation, print a vector of the model coefficients.days_A_glm$coefficients
(Intercept) age wtkg race gender
856.99143293 0.72093510 0.02547379 -30.59783308 4.35691955
$data
, $fitted.values
and $residuals
elements of the days_A_glm
object. What do you think these are? (Bonus: can you calculate the mean squared residual?)# Mean squared residual
mean_squared_residual <- mean(days_A_glm$residuals ^ 2)
days_B_glm
where you model days
as a function of the treatment arm (arms
). Explore the object. Do you find that there was an effect of the treatment arm on days
?days_B_glm <- glm(formula = days ~ arms,
data = trial_act)
summary(days_B_glm)
Call:
glm(formula = days ~ arms, data = trial_act)
Deviance Residuals:
Min 1Q Median 3Q Max
-879.5 -158.7 106.2 203.6 429.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 801.24 12.53 63.967 < 2e-16 ***
armszid_did 115.00 17.80 6.461 1.28e-10 ***
armszid_zal 104.52 17.78 5.878 4.81e-09 ***
armsdid 92.24 17.48 5.276 1.45e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 83467.7)
Null deviance: 182637111 on 2138 degrees of freedom
Residual deviance: 178203547 on 2135 degrees of freedom
AIC: 30316
Number of Fisher Scoring iterations: 2
gender == 1
) above the age of 30 (age > 30
)? To answer this, repeat your previous regression, but include the appropriate arguments to subset
days_B_glm <- glm(formula = days ~ arms,
data = trial_act,
subset = age > 30)
summary(days_B_glm)
Call:
glm(formula = days ~ arms, data = trial_act, subset = age > 30)
Deviance Residuals:
Min 1Q Median 3Q Max
-877.0 -111.9 101.4 195.0 422.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 808.56 14.97 54.030 < 2e-16 ***
armszid_did 122.66 21.28 5.764 1.00e-08 ***
armszid_zal 113.40 21.25 5.336 1.10e-07 ***
armsdid 88.33 20.91 4.223 2.56e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 81967.46)
Null deviance: 123421566 on 1467 degrees of freedom
Residual deviance: 120000363 on 1464 degrees of freedom
AIC: 20781
Number of Fisher Scoring iterations: 2
days_C_glm
that predicts days
based on all variables in the dataset. Explore the object. What do you find?days_C_glm <- glm(formula = days ~ .,
data = trial_act)
summary(days_C_glm)
Call:
glm(formula = days ~ ., data = trial_act)
Deviance Residuals:
Min 1Q Median 3Q Max
-670.09 -55.52 9.28 66.64 427.23
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.116e+02 7.310e+01 12.472 < 2e-16 ***
pidnum -4.446e-05 2.060e-05 -2.158 0.03110 *
age 1.105e-01 4.508e-01 0.245 0.80642
wtkg 4.832e-01 3.160e-01 1.529 0.12641
hemo 2.766e+01 2.052e+01 1.348 0.17797
homo -2.072e+00 1.340e+01 -0.155 0.87710
drugs -1.354e+01 1.217e+01 -1.113 0.26601
karnof -4.481e-01 6.815e-01 -0.658 0.51093
oprior -1.080e+00 2.703e+01 -0.040 0.96813
z30 2.414e+01 1.915e+01 1.261 0.20768
zprior NA NA NA NA
preanti -2.098e-02 1.628e-02 -1.289 0.19777
race -7.649e+00 9.218e+00 -0.830 0.40683
gender 1.120e+01 1.530e+01 0.732 0.46434
str2 -3.384e+01 2.622e+01 -1.291 0.19703
strat 3.955e+01 1.542e+01 2.565 0.01042 *
symptom 8.297e+00 1.003e+01 0.827 0.40850
treat 1.781e+01 1.057e+01 1.684 0.09233 .
offtrt -8.354e+01 9.613e+00 -8.690 < 2e-16 ***
cd40 -6.233e-03 4.526e-02 -0.138 0.89049
cd420 5.820e-02 4.160e-02 1.399 0.16212
cd496 1.764e-01 3.436e-02 5.134 3.26e-07 ***
r NA NA NA NA
cd80 -3.946e-02 1.333e-02 -2.961 0.00312 **
cd820 3.618e-02 1.417e-02 2.553 0.01080 *
cens -3.799e+02 1.051e+01 -36.130 < 2e-16 ***
armszid_did 6.232e+00 1.041e+01 0.599 0.54958
armszid_zal 1.540e+01 1.035e+01 1.489 0.13676
armsdid NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 18020.86)
Null deviance: 74984714 on 1341 degrees of freedom
Residual deviance: 23715452 on 1316 degrees of freedom
(797 observations deleted due to missingness)
AIC: 16987
Number of Fisher Scoring iterations: 2
trial_new
by randomly selecting 10 patients from the original dataset. You can do this using the following code. After you run it, print the trial_new
object to see how it looks# Draw 10 patients at random from trial_act
# and store as trial_new
trial_new <- trial_act %>%
sample_n(10)
trial_new
# A tibble: 10 x 27
pidnum age wtkg hemo homo drugs karnof oprior z30 zprior preanti
<int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
1 250464 47 88.0 0 1 0 80 0 1 1 322
2 231071 34 101. 0 1 0 100 0 0 1 0
3 50236 31 83.0 0 1 0 100 0 0 1 0
4 110581 37 47.8 1 0 0 100 1 0 1 391
5 90109 45 54.0 0 0 0 90 0 1 1 479
6 320391 31 56.3 0 1 1 90 0 1 1 510
7 220492 37 78.5 0 1 0 100 0 0 1 0
8 540021 43 79.8 0 0 1 90 0 0 1 0
9 110697 31 90.4 0 1 0 100 0 1 1 976
10 300507 43 97.5 0 0 0 100 0 0 1 0
# ... with 16 more variables: race <int>, gender <int>, str2 <int>,
# strat <int>, symptom <int>, treat <int>, offtrt <int>, cd40 <int>,
# cd420 <int>, cd496 <int>, r <int>, cd80 <int>, cd820 <int>,
# cens <int>, days <int>, arms <fct>
days_C_glm
, predict the number of days until a major negative event for these 10 randomly selected patients using the following template:# Predict days until a major negative event from the trial_new data
# using the days_C_glm object
predict(object = ___,
newdata = ___)
predict(days_B_glm,
newdata = trial_new)
1 2 3 4 5 6 7 8
931.2235 896.8880 921.9583 808.5601 896.8880 896.8880 808.5601 931.2235
9 10
808.5601 808.5601
family = 'binomial'
argument to glm()
. Using the template below, calculate a logistic regression predicting whether or not someone has taken intravenous drugs based on their age (age
), gender (gender
) and (race
). Then, explore the object to see if any variables reliably predict drug use.# Logistic regression predicting intraveneous drug use
# based on age, gender, and race
drugs_glm_binom <- glm(formula = ___,
data = ___,
family = "___")
# Logistic regression predicting intraveneous drug use
# based on age, gender, and race
drugs_glm_binom <- glm(formula = drugs ~ age + gender + race,
data = trial_act,
family = "binomial")
summary(drugs_glm_binom)
Call:
glm(formula = drugs ~ age + gender + race, family = "binomial",
data = trial_act)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0597 -0.5323 -0.4677 -0.4113 2.3384
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.41495 0.30425 -7.937 2.07e-15 ***
age 0.02994 0.00721 4.152 3.29e-05 ***
gender -0.85074 0.15387 -5.529 3.22e-08 ***
race 0.33542 0.14367 2.335 0.0196 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1664.1 on 2138 degrees of freedom
Residual deviance: 1605.9 on 2135 degrees of freedom
AIC: 1613.9
Number of Fisher Scoring iterations: 4
days_1000_bin
that is a binary variable indicating whether or not days > 1000
. You can do this using mutate()
)days_glm_binom <- glm(formula = days_1000_bin ~ arms,
data = trial_act %>%
mutate(days_1000_bin = days > 1000),
family = "binomial")
summary(days_glm_binom)
Call:
glm(formula = days_1000_bin ~ arms, family = "binomial", data = trial_act %>%
mutate(days_1000_bin = days > 1000))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.236 -1.173 -1.021 1.126 1.343
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.38046 0.08828 -4.310 1.64e-05 ***
armszid_did 0.50322 0.12444 4.044 5.26e-05 ***
armszid_zal 0.51809 0.12435 4.166 3.10e-05 ***
armsdid 0.36977 0.12217 3.027 0.00247 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2964.7 on 2138 degrees of freedom
Residual deviance: 2941.9 on 2135 degrees of freedom
AIC: 2949.9
Number of Fisher Scoring iterations: 4
You can easily generate random samples from statistical distributions in R. To see all of them, run ?distributions
. For example, to generate samples from the well known Normal distribution, you can use rnorm()
. Look at the help menu for rnorm()
to see its arguments.
Using rnorm()
,create a new object samp_10
which is 10 samples from a Normal distribution with mean 10 and standard deviation 5. Print the object to see what the elements look like. What should the mean and standard deviation of this sample? be? Test it by evaluating its mean and standard deviation directly using the appropriate functions. Then, do a one-sample t-test on this sample against the null hypothesis that the true mean is 12. What are the results?
# Note that your results might be different due to random chance!
samp_10 <- rnorm(n = 10, mean = 10, sd = 5)
samp_10
[1] 6.452053 14.623421 16.391728 11.752879 4.435956 7.402795 8.280792
[8] 23.155857 6.655332 5.163521
mean(samp_10)
[1] 10.43143
sd(samp_10)
[1] 6.003185
t.test(x = samp_10, mu = 12)
One Sample t-test
data: samp_10
t = -0.82627, df = 9, p-value = 0.43
alternative hypothesis: true mean is not equal to 12
95 percent confidence interval:
6.137014 14.725853
sample estimates:
mean of x
10.43143
coefficients
of the regression equation will be? Test your prediction by running the code and exploring the my_glm
object.# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)
# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)
# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise
# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)
# Calculate glm
my_glm <- glm(formula = y ~.,
data = my_data)
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)
# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)
# Create dependent variable
y <- 3 * x1 + 2 * x2 - 5 * x3 + 100 + noise
# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)
# Calculate glm
my_glm <- glm(formula = y ~.,
data = my_data)
summary(my_glm)
Call:
glm(formula = y ~ ., data = my_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.42475 -0.77862 -0.07247 0.69691 2.35518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.46598 1.15244 86.31 <2e-16 ***
x1 3.05026 0.11391 26.78 <2e-16 ***
x2 1.99367 0.01071 186.18 <2e-16 ***
x3 -5.03939 0.02343 -215.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.187054)
Null deviance: 84864.07 on 99 degrees of freedom
Residual deviance: 113.96 on 96 degrees of freedom
AIC: 306.85
Number of Fisher Scoring iterations: 2
(Intercept) = -50
, x1 = -3
, x2 = 10
, x3 = 15
# Generate independent variables
x1 <- rnorm(n = 100, mean = 10, sd = 1)
x2 <- rnorm(n = 100, mean = 20, sd = 10)
x3 <- rnorm(n = 100, mean = -5, sd = 5)
# Generate noise
noise <- rnorm(n = 100, mean = 0, sd = 1)
# Create dependent variable
y <- -3 * x1 + 10 * x2 - 15 * x3 + -50 + noise
# Combine all into a tibble
my_data <- tibble(x1, x2, x3, y)
# Calculate glm
my_glm <- glm(formula = y ~.,
data = my_data)
summary(my_glm)
Call:
glm(formula = y ~ ., data = my_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.55143 -0.73330 0.04009 0.69817 3.02294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.44544 1.20587 -41.00 <2e-16 ***
x1 -3.09004 0.11184 -27.63 <2e-16 ***
x2 10.00571 0.01206 829.64 <2e-16 ***
x3 -15.03252 0.02289 -656.80 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.250026)
Null deviance: 1255498 on 99 degrees of freedom
Residual deviance: 120 on 96 degrees of freedom
AIC: 312.02
Number of Fisher Scoring iterations: 2
For more details on hypothesis tests in R, check out the chapter on hypothesis tests in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link
For more advanced mixed level ANOVAs with random effects, consult the afex
and lmer
packages.
To do Bayesian versions of common hypothesis tests, try using the BayesFactor
package. BayesFactor Guide Link