Statistics

class: center, middle, inverse, title-slide

# Statistics
### Basel R Bootcamp <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### July 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">BaselRBootcamp, July 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

# Stats? There is a package for that!

.pull-left45[

| Package| Models|
|------:|:----|
| `stats`|Generalized linear model|
|     `afex`|   Anovas|
|     `lme4`|   Mixed effects regression|
|     `rpart`|    Decision Trees|
|     `BayesFactor`| Bayesian statistics|
|     `igraph`| Network analysis|
|     `neuralnet`| Neural networks|
|     `MatchIt`| Matching and causal inference|
|     `survival`| Longitudinal survival analysis|
|     ...| Anything you can ever want!|

]

.pull-right5[

]

---

.pull-left45[

# In this session...

1 - Basic structure and arguments of most <high>statistical functions</high>

- `formula`: Specify your <high>variables</high>
- `data`: A <high>data frame</high> containing variables.
 
2 - Simple <high>`htest`</high> objects

- `t.test()`, `cor`relation

3 - (Generalized) <high>linear model</high> 
- `lm()`, `glm()`, `aov()`

4 - Explore statistical objects
- `MODEL$NAME`, `print()`, `summary()`, `names()`, `predict()`, `plot()`

5 - Conduct simulations

]

.pull-right5[

```r
# t-test comparing height based on gender
t.test(formula = height ~ sex,
         data = baselers)

# Regression model
inc_glm <- glm(formula = income ~ .,
 data = baselers %>% select(-id))

# Summary information
summary(inc_glm)

# Dissect
inc_glm$coefficients # Acess coefficients
inc_glm$residuals # Access residuals

### Generate random data
x1 <- rnorm(n = 100, mean = 10, sd = 5)
x2 <- rnorm(n = 100, mean = 5, sd = 1)
noise <- rnorm(n= 100, mean = 0, sd = 10)

# Create y as a function of x1, x2, and noise
y <- x1 + x2 + noise

df <- data.frame(x1, x2, y)

# Regress y on x1 and x2
lm(formula = y ~ ., 
   data = df)
```

]

---

# Basic structure of statistical functions

.pull-left4[

Statistical functions always require a <high>data frame</high> called `data`, e.g.,...

|sex | age| height| weight| income|
|:------|---:|------:|------:|------:|
|male | 44| 174.3| 113.4| 6300|
|male | 65| 180.3| 75.2| 10900|
|female | 31| 168.3| 55.5| 5100|
 
They also require a <high>formula</high> that specifies a <high>dependent</high> variable (y) as a function of one or more <high>independent</high> variables (x1, x2, ...) in the form:

formula = y ~ x1 + x2 +...

]

.pull-right55[

How to create a statistical object:

```r
# Example: Create regression object (my_glm)
my_glm <- glm(formula = income ~ age + height,
 data = baselers)
```

![](https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/formula_description.png)

]

---
 
.pull-left35[

# Look for optional arguments

Statistical functions usually have many optional arguments.

Each of these have <high>default</high> values. To customise a test, <high>look at the help menu</high> and specify arguments explicitly.

]

.pull-right6[

Default vs. customised `glm()` (Generalized linear model)

```r
# Default
glm(formula = income ~ age + education,
    data = baselers)

# Customised
glm(formula = eyecor ~ age + education,
    data = baselers,
    family = "binomial")  # Logistic regression
```

Default vs. customised `t-test`

```r
# Default
t.test(formula = age ~ sex,
       data = baselers)

# Customised
t.test(formula = age ~ sex,
       data = baselers,
       alternative = "less", # One sided test
       var.equal = TRUE)     # Assume equal variance
```

]

---
 
.pull-left35[

# Look for optional arguments

```r
?glm
```

]

.pull-right6[

Default vs. customised `glm()` (Generalized linear model)

```r
# Default
glm(formula = income ~ age + education,
    data = baselers)

# Customised
glm(formula = eyecor ~ age + education,
    data = baselers,
    family = "binomial")  # Logistic regression
```

Default vs. customised `t-test`

```r
# Default
t.test(formula = age ~ sex,
       data = baselers)

# Customised
t.test(formula = age ~ sex,
       data = baselers,
       alternative = "less", # One sided test
       var.equal = TRUE)     # Assume equal variance
```

]

---

# Simple hypothesis tests

.pull-left45[

All of the basic <high>one and two sample hypothesis tests</high> are included in the `stats` package.

These tests take either a <high>formula</high> for the argument `formula`, or <high>individual vectors</high> for the arguments `x`, and `y`

.pull-left6[

]

.pull-right5[

### t-test with `t.test()`

```r
# 2-sample t-test
t.test(formula = income ~ sex,
       data = baselers)
```

```
## 
## 	Welch Two Sample t-test
## 
## data:  income by sex
## t = 4, df = 8500, p-value = 6e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  120.6 352.2
## sample estimates:
## mean in group female   mean in group male 
##                 7650                 7414
```

]

---

# Simple hypothesis tests

.pull-left45[

All of the basic <high>one and two sample hypothesis tests</high> are included in the `stats` package.

These tests take either a <high>formula</high> for the argument `formula`, or <high>individual vectors</high> for the arguments `x`, and `y`

.pull-left6[

]

.pull-right5[

### Correlation test with `cor.test()`

```r
# Correlation test
cor.test(x = baselers$age,
         y = baselers$income)

# Version using formula (same result as above)
cor.test(formula = ~ age + income,
         data = baselers)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data: baselers$age and baselers$income
## t = 180, df = 8500, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8882 0.8968
## sample estimates:
## cor 
## 0.8926
```

]

---

# Regression with `glm()`, `lm()`

.pull-left35[

How to <high>create a regression model</high> predicting, e.g., how much money people spend on `food` as a function of `income`?

Part of the `baselers` dataframe:

.pull-left6[

| food| income| happiness|
|----:|------:|---------:|
|  610|   6300|         5|
| 1550|  10900|         7|
|  720|   5100|         7|
|  680|   4200|         7|
|  260|   4000|         5|

]

]

.pull-right6[

### Generalized regression with `glm()`

```r
# food (y) on income (x1) and happiness (x2)
food_glm <- glm(formula = food ~ income + happiness,
 data = baselers)

# Print food_glm
food_glm
```

```
## 
## Call:  glm(formula = food ~ income + happiness, data = baselers)
## 
## Coefficients:
## (Intercept)       income    happiness  
##    -302.089        0.101       52.205  
## 
## Degrees of Freedom: 8509 Total (i.e. Null);  8507 Residual
##   (1490 observations deleted due to missingness)
## Null Deviance:	    1.27e+09 
## Residual Deviance: 6.06e+08 	AIC: 119000
```

]

---

# Customising formulas

Include additional independent variables to formulas by "adding" them with <high>`+`</high>

```r
# Include multiple terms with +
my_glm <- glm(formula = income ~ food + alcohol + happiness + hiking,
 data = baselers)
```

To <high>include all variables</high> in a dataframe, use the catch-all notation <high>`formula = y ~ .`</high>

```r
# Use y ~ . to include ALL variables
my_glm <- glm(formula = income ~ .,
 data = baselers)
```

To include <high>interaction terms</high> use `x1 : x2` or `x1 * x2` (also includes main effects) instead of `x1 + x2`

```r
# Include an interaction term between food and alcohol
my_glm <- glm(formula = income ~ food * alcohol,
 data = baselers)
```

---

# Exploring statistical objects

.pull-left35[

Explore statistical objects using <high>generic</high> functions such as `print()`, `summary()`, `predict()` and `plot()`.

<high>Generic</high> functions different things depending on the <high>class label</high> of the object.

```r
# Create statistical object
obj <- STAT_FUN(formula = ...,
 data = ...)

names(obj)       # Elements
print(obj)       # Print
summary(obj)     # Summary
plot(obj)        # Plotting
predict(obj, ..) # Predict
```

]

.pull-right6[

```r
# Create a glm object
my_glm <- glm(formula = income ~ happiness + age,
 data = baselers)

# print the my_glm object
print(my_glm)
```

```
## 
## Call:  glm(formula = income ~ happiness + age, data = baselers)
## 
## Coefficients:
## (Intercept)    happiness          age  
##        1575         -100          149  
## 
## Degrees of Freedom: 8509 Total (i.e. Null);  8507 Residual
##   (1490 observations deleted due to missingness)
## Null Deviance:	    6.33e+10 
## Residual Deviance: 1.28e+10 	AIC: 145000
```

]

---

# Exploring statistical objects

.pull-left35[

Explore statistical objects using <high>generic</high> functions such as `print()`, `summary()`, `predict()` and `plot()`.

<high>Generic</high> functions different things depending on the <high>class label</high> of the object.

```r
# Create statistical object
obj <- STAT_FUN(formula = ...,
 data = ...)

names(obj)       # Elements
print(obj)       # Print
summary(obj)     # Summary
plot(obj)        # Plotting
predict(obj, ..) # Predict
```

]

.pull-right6[

```r
# Create a glm object
my_glm <- glm(formula = income ~ happiness + age,
 data = baselers)

# Show summary of the my_glm object
summary(my_glm)
```

```
## 
## Call:
## glm(formula = income ~ happiness + age, data = baselers)
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -4045 -835 3 814 4899 
## 
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 1575.497 94.363 16.70 < 2e-16 ***
## happiness -100.431 12.520 -8.02 1.2e-15 ***
## age 149.312 0.815 183.31 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1501842)
## 
## Null deviance: 6.3307e+10 on 8509 degrees of freedom
## Residual deviance: 1.2776e+10 on 8507 degrees of freedom
## (1490 observations deleted due to missingness)
## AIC: 145186
## 
## Number of Fisher Scoring iterations: 2
```

]

---

# Exploring statistical objects

.pull-left35[

Explore statistical objects using <high>generic</high> functions such as `print()`, `summary()`, `predict()` and `plot()`.

<high>Generic</high> functions different things depending on the <high>class label</high> of the object.

```r
# Create statistical object
obj <- STAT_FUN(formula = ...,
 data = ...)

names(obj)       # Elements
print(obj)       # Print
summary(obj)     # Summary
plot(obj)        # Plotting
predict(obj, ..) # Predict
```

]

.pull-right6[

Many <high>statistical objects are lists</high>. Show elements with `names()`, access them with <high>`$`</high>.

```r
# What are the named elements
names(my_glm)
```

```
## [1] "coefficients" "residuals" "fitted.values" "effects" "R" 
## [6] "rank" "qr" "family" "linear.predictors" "deviance"
```
 

 <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/list_tab.png" height="180px" vspace="10px">

]

---

# Exploring statistical objects

.pull-left35[

Explore statistical objects using <high>generic</high> functions such as `print()`, `summary()`, `predict()` and `plot()`.

<high>Generic</high> functions different things depending on the <high>class label</high> of the object.

```r
# Create statistical object
obj <- STAT_FUN(formula = ...,
 data = ...)

names(obj)       # Elements
print(obj)       # Print
summary(obj)     # Summary
plot(obj)        # Plotting
predict(obj, ..) # Predict
```

]

.pull-right6[

```r
# Look at coefficients
my_glm$coefficients
```

```
## (Intercept)   happiness         age 
##      1575.5      -100.4       149.3
```

```r
# First 5 fitted values
my_glm$fitted.values
```

```
##     1     2     3     4     5     6     7     8 
##  7643 10578  5501  4904  4657 10279 11373  6994
```

```r
# First 5 residuals
my_glm$residuals
```

```
##       1       2       3       4       5       6 
## -1343.1   322.2  -401.2  -703.9  -656.8  1120.8
```

]

---

# `predict()`

.pull-left4[

`predict(model, newdata)` allows you to use your `model` to <high>predict outcomes</high> for `newdata`.

```r
last_year
```

| id| age| fitness| tattoos| income|
|--:|---:|-------:|-------:|------:|
|  1|  44|       7|       6|   6300|
|  2|  65|       8|       5|  10900|

```r
this_year
```

|  id| age| fitness| tattoos|income |
|---:|---:|-------:|-------:|:------|
| 101|  21|       3|       4|NA     |
| 102|  23|       6|       8|NA     |

]

.pull-right55[

<high>Fit `model`</high> based on `leastyear`

```r
# Create regression model predicting income
model <- lm(formula = income ~ age + tattoos,
 data = lastyear)

model$coefficients
```

```
## (Intercept)         age     tattoos 
##      1418.3       145.7      -175.5
```

Now use `model` to <high>`predict`</high> values for `thisyear`

```r
# Predict the income of people in thisyear
predict(object = model, 
        newdata = thisyear)
```

```
##    1    2 
## 3776 3366
```

]

---

.pull-left4[

# `tidy()`

The `tidy()` function from the `broom` package <high>converts</high> the most important results of many statistical object like "glm" to a <high>data frame</high>.

```r
# install and load broom
install.packages('broom')
library(broom)
```

]

.pull-right55[

```r
# Original printout
my_glm
```

```r
# Tidy printout
tidy(my_glm)
```

```
##          term estimate std.error statistic   p.value
## 1 (Intercept)   1575.5   94.3629    16.696 1.328e-61
## 2   happiness   -100.4   12.5197    -8.022 1.180e-15
## 3         age    149.3    0.8145   183.315 0.000e+00
```

]

---

# Sampling Functions

.pull-left4[

R gives you a host of functions for sampling data from common <high>statistical distributions</high> (see `?distributions`).

Use these to easily <high>simulate data</high>:

.pull-left55[

]

.pull-right5[

]

---

.pull-left45[

# sample()

Use `sample()` to <high>draw a random sample</high> from a vector.

```r
# Simulate 8 flips of a fair coin
coin_flips <- sample(x = c("H", "T"), 
 size = 8, 
 prob = c(.5, .5),
 replace = TRUE)

coin_flips
```

```
## [1] "T" "H" "T" "H" "H" "H" "T" "H"
```

```r
# Table of counts
table(coin_flips) 
```

```
## coin_flips
## H T 
## 5 3
```

]

.pull-right5[
 

 <high>The Birthday problem</high>
 &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
 <img src="https://raw.githubusercontent.com/therbootcamp/Erfurt_2018June/master/_sessions/_image/balloons.png" height = "100px">

```r
# Create an empty room
birthdays <- c()

# While none of the birthdays are the same,
# keep adding new ones
while(all(!duplicated(birthdays))) {
 
 # Get new birthday
 new_day <- sample(x = 1:365, size = 1)
 
 # Add new_day to birthdays
 birthdays <- c(birthdays, new_day)
 
}

# Done! How many are in the room??
length(birthdays)
```

```
## [1] 51
```

]

---

# `rnorm`, `runif()`, `...`

.pull-left5[

Use the `rnorm()`, `runif()`, `...` functions to draw random <high>samples from probability distributions</high>.

```r
# Random sample from Normal distribution
mysamp <- rnorm(n = 100, # Number of samples
 mean = 10, # Mean of pop
 sd = 5) # SD of pop ...

mysamp[1:5]  # First 5 values
```

```
## [1]  4.265  6.066 10.210  9.924 18.168
```

```r
mean(mysamp) # Mean
```

```
## [1] 10.24
```

```r
sd(mysamp)   # Standard deviation
```

```
## [1] 5.065
```

]

.pull-right45[

]

---

.pull-left4[

# Other great statistics packages

|package|Description|
|:----|:-----|
|`afex`|Factorial experiments|
|`lme4`|Mixed effects models|
|`rstanarm`|Bayesian mixed effects models|
|`BayesFactor`|Bayesian Models|
|`forecast`|Time series| 
|`lavaan`|Latent variable and structural equation modelling|

]

.pull-right55[

]

---

# Summary

.pull-left45[

1 - There are <high>packages for every statistical procedure</high> you can imagine in R.

2 - Most have <high>formula</high> and <high>data</high> arguments (among many others).

3 - Use <high>help files</high> to understand the arguments of functions!

4 - Once you've created a statistical object, use <high>generic functions</high> to explore it: `print()`, `names()`, `summary()`, etc.

5 - Use <high>random sampling</high> functions to run simulations.

]

.pull-right5[

```r
?t.test
```

]

---

# Practical

<a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/Statistics/Statistics_practical.html">Link to practical</a>