Wrangling

class: center, middle, inverse, title-slide

# Wrangling
### Basel R Bootcamp <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### July 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">Basel, July 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

.pull-left4[

# Where you're at...

1 - Loaded packages (like `tidyverse`) with `library()`

2 - Loaded external files as a new dataframe

3 - Explore dataframes

4 - Calculate descriptive statistics on specific columns

What's next?... <high>Wrangling!</high>

]

.pull-right55[

```r
# Step 0) Load libraries

library(tidyverse)

# Step 1) Read file called baslers.txt 
#  in a data folder with read_csv()
#  and save as new object baslers

baslers <- read_csv(file = "data/baslers.txt")

# Step 2) Explore data

View(baslers)   # Open in new window
dim(baslers)    # Show number of rows and columns
names(baslers)  # Show names

# Step 3) Calculate descriptives on named colums

mean(baslers$age)   # What is the mean age?
table(baslers$sex)  # How many of each sex?

# Step 4) ...
```

]

---
.pull-left45[

# What is wrangling?

<high>Transform</high>

Change column names

Add new columns

<high>Organise</high>

Sort data by columns

Merging data from two separate dataframes

Move data between columns and rows

<high>Aggregate</high>

Group data and summarise
]

.pull-right5[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/organise_transform_aggregate.png" height = "550px">

]

---

.pull-left45[

# `dplyr` & `tidyr`

To wrangle data in R, we will use the `dplyr` and `tidyr` packages.

```r
# Load packages individually

# install.packages('dplyr')
# install.packages('tidyr')

library(dplyr)
library(tidyr)

# Or just use the tidyverse!

# install.packages('tidyverse')

library(tidyverse)
```

]
 
.pull-right5[

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

`dplyr` makes extensive use of a new operator called the "Pipe" <high>`%>%`</high>

Read the "Pipe" <high>`%>%`</high> as "And Then..."

```r
# Start with data
data %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
```

]

.pull-right55[

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "450px"> 
 This is not a pipe (but %>% is!)

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

Task: Calculate the mean of a vector of `scores`

```r
# Create a vector score
score <- c(8, 4, 6, 3, 7, 3)
```

Base-R method

```r
mean(x = score)
```

```
## [1] 5.167
```

Pipe <high>%>%</high> method

```r
score %>%  # AND THEN 
  mean()  
```

```
## [1] 5.167
```

]

.pull-right55[

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

Task: Calculate the mean of a vector of `scores` and round to 1 digit.

```r
# Create a vector score
score <- c(8, 4, 6, 3, 7, 3)
```

Base-R method

```r
round(x = mean(score), digits = 1)
```

```
## [1] 5.2
```

Pipe <high>%>%</high> method

```r
score %>%        # AND THEN
  mean() %>%       # AND THEN
  round(digits = 1)
```

```
## [1] 5.2
```

]

.pull-right55[

]

---

.pull-left35[

# `dplyr` Functions

There are <high>dozens of wrangling functions</high> in `dplyr`.

For an overview, check out <a href='http://dplyr.tidyverse.org/'>dplyr.tidyverse.org</a>

<a href = 'https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf'>
 <img src="https://image.slidesharecdn.com/data-wrangling-cheatsheet-160705210122/95/data-wrangling-with-dplyr-and-tidyr-cheat-sheet-2-638.jpg?cb=1467752577g" height="270px"></a>
 <a href= 'https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf'> Wrangling Cheat Sheet</a>

]

.pull-right65[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/dplyr_functions.png" height="420px">

]

---

# Transformation Functions

.pull-left4[

| Function| Description|
|:-------------|:----|
| `rename()` | Change column names |
| `mutate()`|   Create a new column from existing columns|
| `case_when()`|  Recode values from a vector to another|
| `left_join()` | Combine multiple dataframes|

]

.pull-right55[

```r
patients_df   # Demographic data
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```

]

---

# `rename`

.pull-left45[

Change <high>column names</high> with `rename()`.

```r
df %>%
  rename(NEW = OLD,
         NEW = OLD)
```

```r
patients_df  # Original
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```
]

.pull-right45[

Change the old name "b" to "age", and "c" to "arm"

```r
# 0) Start with patients_df data
patients_df %>%

# 1) Change column names with rename()
  rename(age = b,  # New = Old
         arm = c)  # New = Old
```

```
## # A tibble: 5 x 3
## id age arm
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```

]

---

# `mutate`

.pull-left35[

Calculate <high>new columns</high>, or change existing ones, with `mutate()`.

```r
df %>%
  mutate(
   NEW1 = DEFINITION1,
   NEW2 = DEFINITION2,
   NEW3 = DEFINITION3,
   ...
  )
```
]

.pull-right6[

Calculate two new columns `age_months` and `age_years`

```r
patients_df %>%

rename(age = b, 
         arm = c) %>%   # AND THEN...
  
# Create new columns with mutate()
  mutate(age_months = age * 12,
         age_decades = age / 10)
```

```
## # A tibble: 5 x 5
## id age arm age_months age_decades
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 37 1 444 3.7
## 2 2 65 2 780 6.5
## 3 3 57 2 684 5.7
## 4 4 34 1 408 3.4
## 5 5 45 2 540 4.5
```

]

---

# `case_when`

.pull-left4[

Use `case_when()` with `mutate()` to define <high>new columns based on logical conditions</high>.

```r
# Using mutate(case_when())
df %>%
  mutate(
    NEW = case_when(
      COND1 ~ VAL1,
      COND2 ~ VAL2
    ))
```

]

.pull-right55[

```r
patients_df %>%

rename(age = b, 
         arm = c) %>%
  
  # Create column arm_char based on values of arm
  
  mutate(arm_char = case_when(arm == 1 ~ "placebo",
                              arm == 2 ~ "drug"))
```

```
## # A tibble: 5 x 4
## id age arm arm_char
## <dbl> <dbl> <dbl> <chr> 
## 1 1 37 1 placebo 
## 2 2 65 2 drug 
## 3 3 57 2 drug 
## 4 4 34 1 placebo 
## 5 5 45 2 drug
```

]

---

# Joining data
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/joining_data.png" height="450px">

---

# `left_join`

.pull-left35[

Use `left_join()` to <high>combine two data frames</high> based on one or more key columns

```r
# Join df2 to df1
#   using KEY as the key column
df1 %>%
  left_join(df2, 
      by = c("KEY"))
```

]

.pull-right6[

```r
# Join patients_df with results_df to create combined_df
combined_df <- patients_df %>% 
 rename(age = b, arm = c) %>%
 mutate(arm_char = case_when(arm == 1 ~ "placebo",
 arm == 2 ~ "drug")) %>%
 
# Join with results_df with left_join()
 left_join(results_df, by = "id")

# Show a few columns
combined_df %>%
  select(id, arm, age, t1, t2)
```

```
## # A tibble: 5 x 5
## id arm age t1 t2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 37 123 135
## 2 2 2 65 143 140
## 3 3 2 57 NA NA
## 4 4 1 34 100 105
## 5 5 2 45 NA NA
```

]

---

# Keep in mind

.pull-left35[

1 - Don't forget to start by assigning to a new (or existing) object with <-

2 - Keep adding new functions connected by the pipe <high>%>%</high>

3 - Order matters! You can refer to new columns in later code

]

.pull-right6[

```r
# 0) Start with patients_df data
combined_df <- patients_df %>%
 
# 1) Change column names with rename()
 rename(age = b, 
 arm = c) %>% # AND THEN...
 
# 2) Create new columns with mutate()
 mutate(age_months = age * 12,
 age_decades = age / 10,
 arm_char = case_when(arm == 0 ~ "placebo",
 arm == 1 ~ "drug")
 ) %>% # AND THEN..
 
# 3) Add data from results_df with left_join()
 left_join(results_df, by = "id")
```

]

---

# Organisation Functions

.pull-left35[

Organisation functions help you shuffle your data by <high>sorting rows</high> by columns, <high>filter rows</high> based on criteria, <high>select columns</high> (etc).

]

.pull-right6[

| Function| Purpose|Example|
|:--------|:----|:-------------|
| `arrange()` |Sort rows by columns|`df %>%` `arrange(arm, age)`|
| `slice()`| Select rows by location|`df %>%` `slice(1:10)`|
| `filter()` | Select specific rows by criteria|`df %>%` `filter(age > 50)`|
| `select()`| Select specific columns|`df %>%` `select(arm, t1)`|

]

---

# `arrange()`

.pull-left3[

Use `arrange()` to <high>arrange (aka, sort) rows</high> in increasing or decreasing order of one (or more) columns.

To sort in descending order, use `desc()`

]

.pull-right65[

```r
# Sort combined_df by id 
# then in descending order of age
combined_df %>%
  arrange(id, desc(age))
```

```
## # A tibble: 5 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 2 65 2 drug 143 140
## 3 3 57 2 drug NA NA
## 4 4 34 1 placebo 100 105
## 5 5 45 2 drug NA NA
```

]

---

# `filter()`

.pull-left4[

Use `filter()` to <high>select rows</high> (and remove others) based on criteria

For complex conditions, chain multiple logical comparison operators with `&` (AND) and `|` (OR)

<high>`==`</high> - is equal to 
<high>`<`</high>, <high>`>`</high> - smaller/greater than 
<high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal 
<high>`&`</high>, <high>`&&`</high> - logical AND 
<high>`|`</high>, <high>`||`</high> - logical OR

]

.pull-right55[

Select only patients over 30 given drug.

```r
# Filter patients older than 30 given drug
combined_df %>%
  arrange(id, desc(age)) %>%
  filter(age > 30 & arm_char == "drug")
```

```
## # A tibble: 3 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2 65 2 drug 143 140
## 2 3 57 2 drug NA NA
## 3 5 45 2 drug NA NA
```

]

---

# `select()`

.pull-left4[

Use `select()` to <high>select columns</high> (and remove all others)

```r
# Select columns id, age
df %>% 
  select(id, age)
```

<high>Remove columns</high> with `-`.

```r
# Select everything BUT sex and id
df %>% 
  select(-sex, -id)
```

]

.pull-right55[

Remove `age_months` and `age_decades` colummns.

```r
combined_df %>%
  
  arrange(id, desc(age)) %>%
  
  filter(arm_char == "drug" & age > 30) %>%
  
# Drop age and arm columns
  select(-age, -arm)
```

```
## # A tibble: 3 x 4
## id arm_char t1 t2
## <dbl> <chr> <dbl> <dbl>
## 1 2 drug 143 140
## 2 3 drug NA NA
## 3 5 drug NA NA
```

]

---

# Grouped Aggregation

---

# `group_by()`, `summarise()`

.pull-left45[

Use `group_by()` to <high>group data</high> according to one or more columns

After grouping data, use `summarise()` to <high>calculate summary statistics</high> across groups of data

Statistical functions

| Function| asdf |
|:-------------|:---|
| `n()`| Number of cases in each group|
| `mean()`, `median()`, `max()`, `min()` `sum()` | Summary stats|
]

.pull-right45[

```r
# Group data by arm, and calculate many
#  summary statistics
combined_df %>%
  group_by(arm) %>%
  summarise(
    N = n(),
    age_mean = mean(age),
    t1_mean = mean(t1, na.rm = TRUE),
    t2_mean = mean(t2, na.rm = TRUE)
  )
```

```
## # A tibble: 2 x 5
## arm N age_mean t1_mean t2_mean
## <chr> <int> <dbl> <dbl> <dbl>
## 1 drug 3 55.7 143 140
## 2 placebo 2 35.5 112. 120
```

]

---

# Reshaping data

.pull-left45[

Two key functions that allow you to <high>`reshape`</high> a dataframe between 'wide'and 'long' formats.

Some functions require data to be in a certain shape.

#### Two key tidyr functions

| Function | Result |
|:----------|:-------|
|`gather()`|Move data from 'wide' to 'long' format|
|`spread()`|Move data from 'long' to 'wide' format|

]

.pull-right5[

#### 'Wide' vs. 'Long' data

```r
# Wide format
stock_w
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

```r
# Long format
stock_l
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

]

---

.pull-left45[

# `gather()`

```r
# Show wide data
stock_w
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

```r
# "Gather" wide data to long
stock_w %>% 
  gather(time,    # New group column
         measure, # New target column
         -id)     # ID column
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

]

.pull-right45[

# `spread()`

```r
# Show long data
stock_l
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

```r
# "Spread" long data to wide
stock_l %>% 
  spread(time,    # Old group column
         measure) # Old target column
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

]

---

# Summary

.pull-left4[

1 - Start by assigning your result to a new object to save it!

2 - "Keep the pipe <high>%>%</high> going" to continue working with your data frame.

3 - The output of dplyr functions will (almost) always be a <high>tibble</high>.

4 - You can almost always include <high>multiple operations</high> within each function.

]

.pull-right55[

```r
# Assign result to baslers_agg

baslers_agg <- baselers %>% 
 
 # Change column names with rename()
 rename(age_years = age,
 weight_kg = weight) %>% # PIPE!

# Select specific rows with filter()
 filter(age_years < 40) %>% # PIPE!
 
 # Create new columns witb mutate()
 mutate(debt_ratio = debt / income) %>% # PIPE!
 
 # Group data with group_by()
 group_by(sex) %>% # PIPE!
 
 # Calculate summary statistics with summarise()
 summarise(income_mean = mean(income),
 debt_mean = mean(debt),
 dr_mean = mean(dr))
```

]

---

# Practical

<a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/Wrangling/Wrangling_practical.html">Link to practical</a>