Recap

class: center, middle, inverse, title-slide

# Recap
### The R Bootcamp @ Erfurt <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### June 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">Erfurt, June 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

# Essentials: The 24 Rules of the R Bootcamp

.pull-left4[
1. Everything is an object
2. Use `<-` to create/change objects
3. Name objects using `_`
4. Objects have classes
5. Everything happens through functions
6. Functions have arguments
7. Arguments can have defaults
8. Functions expect certain object classes
9. View help files using `?`
10. Data is stored in data frames
11. Select variables (vectors) using `$`
12. Use RStudio and projects
13. Code in the editor
14. First load packages and data
15. Comment and format for readability
16. Struggle, ask for help, struggle...
]

---

# The almighty **tidyverse**

Among its many packages, R contains a collection of high-performance, easy-to-use packages (libraries) designed specifically for handling data know as the [tidyverse](https://www.tidyverse.org/). The tidyverse includes:
1. `ggplot2` -- creating graphics.
2. `dplyr` -- data manipulation.
3. `tidyr` -- tidying data.
4. `readr` -- read wild data.
5. `purrr` -- functional programming.
6. `tibble` -- modern data frame.

---

.pull-left5[

# What is Wrangling?

Transform
- Adding new columns
- Combining columns
- Splitting columns

Organise
- Moving data between columns and rows
- Merging several dataframes
- Sorting data by columns

Aggregation
- Aggregate data according to variables
- Summarizing data across groups

]

.pull-right5[

]

---

.pull-left35[
# dplyr

`dplyr` is a combination of 3 things:

1. **`objects`** like dataframes
2. **`functions`** that **do** things to objects.
3. **`pipes`** `%>%` that string together objects and verbs

]

.pull-right6[
 
## The pipe %>%
<br2>

`dplyr` makes extensive use of the 'pipe' `%>%` which passes objects between functions.

```r
data %>%    # Start with data, AND THEN...
  FUN1 %>% # Do FUN1, AND THEN...
  FUN2 %>% # Do FUN2, AND THEN...
  FUN3 %>% # Do FUN3, AND THEN...
  group_by(x, y) %>%  # Group by variables x, y
    summarise(
      VAR_A_New = FUN4(X),
      VAR_B_New = FUN5(Y),
      VAR_C_New = FUN6(Z),
    )
  )
```

]

---

.pull-left4[
# Transformation Functions

| Function| Description|
|:-------------|:----|
| `rename()` | Change column names |
| `mutate()`|   Create a new column from existing columns|
| `case_when()`|  Recode values from a vector to another|
| `left_join()` | Combine multiple dataframes|

]

.pull-right55[
 
### patients_df

```r
patients_df   # Demographic data
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 1. 37. 1.
## 2 2. 65. 2.
## 3 3. 57. 2.
## 4 4. 34. 1.
## 5 5. 45. 2.
```

]

---

.pull-left55[

# Organisation Functions

Organisation functions help you shuffle your data by sorting rows by columns, filter rows based on criteria, select columns (etc.)

| Function| Purpose|Example|
|:--------|:----|:------------|
| `arrange()` |Sort rows by columns|`df %>%` `arrange(arm, age)`|
| `slice()`| Select rows by location|`df %>%` `slice(1:10)`|
| `filter()` | Select specific rows by criteria|`df %>%` `filter(age > 50)`|
| `select()`| Select specific columns|`df %>%` `select(arm, t1)`|

]

.pull-right4[

Organise the `combined_df` dataframe

```r
combined_df %>%
  
  # Sort rows by arm then in 
  #  descending order of age
  arrange(arm, desc(age)) %>%
  
  # Only include age > 50
  filter(age > 50) %>%
  
  # Select these columns
  select(arm, age, t1, t2)
```

]

---

.pull-left4[

# Aggregation functions

| Function| action|
|:-------------|:----|
| `group_by()` |Group data by levels of one or more variables |
| `summarise()` | Calculate summary statistics |

### Statistical functions

| Function| action|
|:-------------|:----|
| `min(), max()`| Minimum, maximum |
| `mean(), median()` |Mean, Median |
| `sd()` |Standard deviation|
| `sum()` | Sum|
| `n()`| Number of cases|
]

.pull-right55[

```r
# First 2 rows of raw data
combined_df %>% slice(1:2)

# Group and summarise!
combined_df %>%
  group_by(arm) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    t1_mean = mean(t1, na.rm = TRUE),
    t2_mean = mean(t2, na.rm = TRUE)
  )
```

]

---

## If you want to do statistics in R, there is a package for that!

.pull-left5[

| Package| Models|
|------:|:----|
| `stats`|Generalized linear model|
|     `afex`|   Anovas|
|     `lme4`|   Mixed effects regression|
|     `rpart`|    Decision Trees|
|     `BayesFactor`| Bayesian statistics|
|     `igraph`| Network analysis|
|     `neuralnet`| Neural networks|
|     `MatchIt`| Matching and causal inference|
|     `survival`| Longitudinal survival analysis|
|     ...| Anything you can ever want!|

]

.pull-right45[

]

---

## Customising formulas

You can keep adding terms to formulas by "adding"" them with `+`

```r
# Include multiple terms with +
my_glm <- glm(formula = income ~ food + alcohol + happiness + hiking,
 data = baselers)
```

To include *all* variables in a dataframe, use the catch-all notation `formula = y ~ .`

```r
# Use y ~ . to include ALL variables
my_glm <- glm(formula = income ~ .,
 data = baselers)
```

To specify *interaction terms* use `x1:x2` or  `x1 * x2` instead of `x1 + x2`

```r
# Include an interaction term between food and alcohol
my_glm <- glm(formula = income ~ food * alcohol,
 data = baselers)
```

---

# Today

<a href="https://therbootcamp.github.io/Erfurt_2018June">Schedule</a>