Wrangling

class: center, middle, inverse, title-slide

# Wrangling
### Introduction to Data Science with R <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### October 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">Introduction to Data Science with R, October 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

.pull-left4[

# Where you're at...

1 - Loaded packages (like `tidyverse`) with `library()`

2 - Loaded external files as a new dataframe

3 - Explore dataframes

4 - Calculate descriptive statistics on specific columns

What's next?... <high>Wrangling!</high>

]

.pull-right55[

```r
# Step 0) Load libraries

library(tidyverse)

# Step 1) Read file called baslers.txt 
#  in a data folder with read_csv()
#  and save as new object baslers

baslers <- read_csv(file = "data/baslers.txt")

# Step 2) Explore data

View(baslers)   # Open in new window
dim(baslers)    # Show number of rows and columns
names(baslers)  # Show names

# Step 3) Calculate descriptives on named colums

mean(baslers$age)   # What is the mean age?
table(baslers$sex)  # How many of each sex?

# Step 4) ...
```

]

---
.pull-left45[

# What is wrangling?

<high>Transform</high>

Change column names

Add new columns

<high>Organise</high>

Sort data by columns

Merging data from two separate dataframes

Move data between columns and rows

<high>Aggregate and summarise</high>

Group data and calculate and summary stats

*We'll do this tomorrow!*

]

.pull-right5[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/organise_transform_aggregate.png" height = "550px">

]

---

.pull-left45[

# `dplyr` & `tidyr`

To wrangle data in R, we will use the `dplyr` and `tidyr` packages.

```r
# Load packages individually

# install.packages('dplyr')
# install.packages('tidyr')

library(dplyr)
library(tidyr)

# Or just use the tidyverse!

# install.packages('tidyverse')

library(tidyverse)
```

]
 
.pull-right5[

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

`dplyr` makes extensive use of a new operator called the "Pipe" <high>`%>%`</high>

Read the "Pipe" <high>`%>%`</high> as "And Then..."

```r
# Start with data
data %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
  
DO_SOMETHING %>% # AND THEN...
```

]

.pull-right55[

<img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "450px"> 
 This is not a pipe (but %>% is!)

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

Task: Calculate the mean of a vector of `scores`

```r
# Create a vector score
score <- c(8, 4, 6, 3, 7, 3)
```

Base-R method

```r
mean(x = score)
```

```
## [1] 5.167
```

Pipe <high>%>%</high> method

```r
score %>%  # AND THEN 
  mean()  
```

```
## [1] 5.167
```

]

.pull-right55[

]

---

# The Pipe! <high>`%>%`</high>

.pull-left4[

Task: Calculate the mean of a vector of `scores` and round to 1 digit.

```r
# Create a vector score
score <- c(8, 4, 6, 3, 7, 3)
```

Base-R method

```r
round(x = mean(score), digits = 1)
```

```
## [1] 5.2
```

Pipe <high>%>%</high> method

```r
score %>%        # AND THEN
  mean() %>%       # AND THEN
  round(digits = 1)
```

```
## [1] 5.2
```

]

.pull-right55[

]

---

.pull-left35[

# `dplyr` Functions

There are <high>dozens of wrangling functions</high> in `dplyr`.

For an overview, check out <a href='http://dplyr.tidyverse.org/'>dplyr.tidyverse.org</a>

<a href = 'https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf'>
 <img src="https://image.slidesharecdn.com/data-wrangling-cheatsheet-160705210122/95/data-wrangling-with-dplyr-and-tidyr-cheat-sheet-2-638.jpg?cb=1467752577g" height="270px"></a>
 <a href= 'https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf'> Wrangling Cheat Sheet</a>

]

.pull-right65[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/dplyr_functions.png" height="420px">

]

---

# Transformation Functions

.pull-left4[

| Function| Description|
|:-------------|:----|
| `rename()` | Change column names |
| `mutate()`|   Create a new column from existing columns|
| `case_when()`|  Recode values from a vector to another|
| `left_join()` | Combine multiple dataframes|

]

.pull-right55[

```r
patients_df   # Demographic data
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```

]

---

# `rename()`

.pull-left45[

Change <high>column names</high> with `rename()`.

```r
df %>%
  rename(NEW = OLD,
         NEW = OLD)
```

```r
patients_df  # Original
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```
]

.pull-right45[

Change the old name "b" to "age", and "c" to "arm"

```r
# 0) Start with patients_df data
patients_df %>%

# 1) Change column names with rename()
  rename(age = b,  # New = Old
         arm = c)  # New = Old
```

```
## # A tibble: 5 x 3
## id age arm
## <dbl> <dbl> <dbl>
## 1 1 37 1
## 2 2 65 2
## 3 3 57 2
## 4 4 34 1
## 5 5 45 2
```

]

---

# `mutate()`

.pull-left35[

Calculate <high>new columns</high>, or change existing ones, with `mutate()`.

```r
df %>%
  mutate(
   NEW1 = DEFINITION1,
   NEW2 = DEFINITION2,
   NEW3 = DEFINITION3,
   ...
  )
```
]

.pull-right6[

Calculate two new columns `age_months` and `age_decades`

```r
patients_df %>%

rename(age = b, 
         arm = c) %>%   # AND THEN...
  
# Create new columns with mutate()
  mutate(age_months = age * 12,
         age_decades = age / 10)
```

```
## # A tibble: 5 x 5
## id age arm age_months age_decades
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 37 1 444 3.7
## 2 2 65 2 780 6.5
## 3 3 57 2 684 5.7
## 4 4 34 1 408 3.4
## 5 5 45 2 540 4.5
```

]

---

# `case_when()`

.pull-left4[

Use `case_when()` with `mutate()` to define <high>new columns based on logical conditions</high>.

```r
# Using mutate(case_when())
df %>%
  mutate(
    NEW = case_when(
      COND1 ~ VAL1,
      COND2 ~ VAL2
    ))
```

]

.pull-right55[

Create `arm_char`, which shows arm as a meaningful character rather than an integer.

```r
patients_df %>%

rename(age = b, 
         arm = c) %>%
  
  # Create column arm_char based on values of arm
  
  mutate(arm_char = case_when(arm == 1 ~ "placebo",
                              arm == 2 ~ "drug"))
```

```
## # A tibble: 5 x 4
## id age arm arm_char
## <dbl> <dbl> <dbl> <chr> 
## 1 1 37 1 placebo 
## 2 2 65 2 drug 
## 3 3 57 2 drug 
## 4 4 34 1 placebo 
## 5 5 45 2 drug
```

]

---

# Joining data
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/joining_data.png" height="450px">

---

# `left_join()`

.pull-left35[

Use `left_join()` to <high>combine two data frames</high> based on one or more key columns

```r
# Join df2 to df1
#   using KEY as the key column
df1 %>%
  left_join(df2, 
      by = c("KEY"))
```

]

.pull-right6[

```r
# Join patients_df with results_df to create combined_df
combined_df <- patients_df %>% 
 rename(age = b, arm = c) %>%
 mutate(arm_char = case_when(arm == 1 ~ "placebo",
 arm == 2 ~ "drug")) %>%
 
# Join with results_df with left_join()
 left_join(results_df, by = "id")

# Show a few columns
combined_df %>%
  select(id, arm, age, t1, t2)
```

```
## # A tibble: 5 x 5
## id arm age t1 t2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 37 123 135
## 2 2 2 65 143 140
## 3 3 2 57 NA NA
## 4 4 1 34 100 105
## 5 5 2 45 NA NA
```

]

---

.pull-left35[

# Keep in mind

1 - Don't forget to start by assigning to a new (or existing) object with <-

2 - Keep adding new functions connected by the pipe <high>%>%</high>

3 - Order matters! You can refer to new columns in later code

]

.pull-right6[

```r
# 0) Start with patients_df data
combined_df <- patients_df %>%
 
# 1) Change column names with rename()
 rename(age = b, 
 arm = c) %>% # AND THEN...
 
# 2) Create new columns with mutate()
 mutate(age_months = age * 12,
 age_decades = age / 10,
 arm_char = case_when(arm == 0 ~ "placebo",
 arm == 1 ~ "drug")
 ) %>% # AND THEN..
 
# 3) Add data from results_df with left_join()
 left_join(results_df, by = "id")
```

]

---

.pull-left5[
# Quiz 1

Do these two chunks do the same thing?

### Chunk A

```r
baselers <- baselers %>%
 rename(salary = income)

baselers <- baselers %>%
 rename(weight = weight_kg)
```

]

.pull-right45[
 
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income,
 weight = weight_kg)
```

]

---

.pull-left5[
# Quiz 1

Do these two chunks do the same thing?

### Chunk A

```r
baselers <- baselers %>%
 rename(salary = income)

baselers <- baselers %>%
 rename(weight = weight_kg)
```

*This is a long way to do it*

]

.pull-right45[
 <high>Answer: Yes!</high>
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income,
 weight = weight_kg)
```

*This is simpler, you can do multiple rename() operations at the same time*

]

---

.pull-left5[
# Quiz 2

Do these two chunks do the same thing?

### Chunk A

```r
baselers <- baselers %>%
 rename(salary = income)

baselers <- baselers %>%
 mutate(age_months = age * 12)
```

]

.pull-right45[
 
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income) %>%
 mutate(age_months = age * 12)
```

]

---

.pull-left5[
# Quiz 2

Do these two chunks do the same thing?

### Chunk A

```r
baselers <- baselers %>%
 rename(salary = income)

baselers <- baselers %>%
 mutate(age_months = age * 12)
```

*This is the long way to do multiple operations.*

]

.pull-right45[
 <high>Answer: Yes!</high>
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income) %>%
 mutate(age_months = age * 12)
```
*This is the short way that uses the pipe to continue working on the same dataframe*

]

---

.pull-left5[
# Quiz 3

Do these two chunks do the same thing?

### Chunk A

```r
baselers %>%
      rename(salary = income) %>%
      mutate(age_months = age * 12)
```

]

.pull-right45[
 
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income) %>%
 mutate(age_months = age * 12)
```

]

---

.pull-left5[
# Quiz 3

Do these two chunks do the same thing?

### Chunk A

```r
baselers %>%
      rename(salary = income) %>%
      mutate(age_months = age * 12)
```

*This just prints a result without changing anything*

]

.pull-right45[
 <high>Answer: No!</high>
### Chunk B

```r
baselers <- baselers %>%
 rename(salary = income) %>%
 mutate(age_months = age * 12)
```
*This actually changes the baselers dataframe because we used assignment <- !*

]

---

# Organisation Functions

.pull-left35[

Organisation functions help you shuffle your data by <high>sorting rows</high> by columns, <high>filter rows</high> based on criteria, <high>select columns</high> (etc).

]

.pull-right6[

| Function| Purpose|Example|
|:--------|:----|:-------------|
| `arrange()` |Sort rows by columns|`df %>%` `arrange(arm, age)`|
| `slice()`| Select rows by location|`df %>%` `slice(1:10)`|
| `filter()` | Select specific rows by criteria|`df %>%` `filter(age > 50)`|
| `select()`| Select specific columns|`df %>%` `select(arm, t1)`|

]

---

# `arrange()`

.pull-left4[

Use `arrange()` to <high>arrange (aka, sort) rows</high> in increasing or decreasing order of one (or more) columns.

```r
df %>%
  arrange(A, B)
```

To sort in descending order, use `desc()`

```r
df %>%
  arrange(desc(A), B)
```

]

.pull-right55[

Sort by `arm`.

```r
combined_df %>%
  arrange(arm)   # Sort by arm
```

```
## # A tibble: 5 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 4 34 1 placebo 100 105
## 3 2 65 2 drug 143 140
## 4 3 57 2 drug NA NA
## 5 5 45 2 drug NA NA
```

]

---

# `arrange()`

.pull-left4[

Use `arrange()` to <high>arrange (aka, sort) rows</high> in increasing or decreasing order of one (or more) columns.

```r
df %>%
  arrange(A, B)
```

To sort in descending order, use `desc()`

```r
df %>%
  arrange(desc(A), B)
```

]

.pull-right55[

Sort by `arm` and then `age`.

```r
combined_df %>%
  arrange(arm, age)  # Sort by arm then age
```

```
## # A tibble: 5 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 4 34 1 placebo 100 105
## 2 1 37 1 placebo 123 135
## 3 5 45 2 drug NA NA
## 4 3 57 2 drug NA NA
## 5 2 65 2 drug 143 140
```

]

---

# `slice()`

.pull-left4[

Use `slice()` to <high>select rows</high> (and remove others) by row number.

Use functions like `c()`, `a:b` and `seq()` to create row numbers

```r
# Specific numbers
c(2, 6, 10)
```

```
## [1]  2  6 10
```

```r
# Integers from 0 to 5
0:5
```

```
## [1] 0 1 2 3 4 5
```

]

.pull-right55[

Select rows `3` and `5`.

```r
# Rows 3 and 5 only
combined_df %>%
  slice(c(3, 5))
```

```
## # A tibble: 2 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 3 57 2 drug NA NA
## 2 5 45 2 drug NA NA
```

]

---

# `slice()`

.pull-left4[

Use `slice()` to <high>select rows</high> (and remove others) by row number.

Use functions like `c()`, `a:b` and `seq()` to create row numbers

```r
# Specific numbers
c(2, 6, 10)
```

```
## [1]  2  6 10
```

```r
# Integers from 0 to 5
0:5
```

```
## [1] 0 1 2 3 4 5
```

]

.pull-right55[

Select rows `1` through `5`.

```r
# First 5 rows
combined_df %>%
  slice(1:5)
```

```
## # A tibble: 5 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 2 65 2 drug 143 140
## 3 3 57 2 drug NA NA
## 4 4 34 1 placebo 100 105
## 5 5 45 2 drug NA NA
```

]

---

# `filter()`

.pull-left4[

Use `filter()` to <high>select rows</high> (and remove others) based on criteria

For complex conditions, chain multiple logical comparison operators with `&` (AND) and `|` (OR)

<high>`==`</high> - is equal to 
<high>`<`</high>, <high>`>`</high> - smaller/greater than 
<high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal 
<high>`&`</high>, <high>`&&`</high> - logical AND 
<high>`|`</high>, <high>`||`</high> - logical OR

]

.pull-right55[

Select only patients over `30`.

```r
# Filter patients older than 30
combined_df %>%
  filter(age > 30)
```

]

---

# `filter()`

.pull-left4[

Use `filter()` to <high>select rows</high> (and remove others) based on criteria

For complex conditions, chain multiple logical comparison operators with `&` (AND) and `|` (OR)

]

.pull-right55[

Select only patients over `30` given drug.

```r
# Filter patients older than 30 given drug
combined_df %>%
  filter(age > 30 & arm_char == "drug")
```

```
## # A tibble: 3 x 6
## id age arm arm_char t1 t2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 2 65 2 drug 143 140
## 2 3 57 2 drug NA NA
## 3 5 45 2 drug NA NA
```

]

---

# `select()`

.pull-left4[

Use `select()` to <high>select columns</high> (and remove all others)

```r
# Select columns A, B
df %>% 
  select(A, B)
```

<high>Remove columns</high> with `-`.

```r
# Select everything BUT A
df %>% 
  select(-A)
```

]

.pull-right55[

Select columns `id` and `arm`

```r
combined_df %>%
  select(id, arm) # Select id and arm columns
```

```
## # A tibble: 5 x 2
## id arm
## <dbl> <dbl>
## 1 1 1
## 2 2 2
## 3 3 2
## 4 4 1
## 5 5 2
```

]

---

# `select()`

.pull-left4[

Use `select()` to <high>select columns</high> (and remove all others)

```r
# Select columns A, B
df %>% 
  select(A, B)
```

<high>Remove columns</high> with `-`.

```r
# Select everything BUT A
df %>% 
  select(-A)
```

]

.pull-right55[

Select everything except id

```r
combined_df %>%
  select(-id) # Everything BUT id
```

```
## # A tibble: 5 x 5
## age arm arm_char t1 t2
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 37 1 placebo 123 135
## 2 65 2 drug 143 140
## 3 57 2 drug NA NA
## 4 34 1 placebo 100 105
## 5 45 2 drug NA NA
```

]

---

# Quiz 4

.pull-left45[

Here is part of the baselers dataframe

```r
baselers %>%
  select(id, sex, age, height) %>%
  slice(1:5)
```

```
## # A tibble: 5 x 4
## id sex age height
## <int> <chr> <int> <dbl>
## 1 1 male 44 174.
## 2 2 male 65 180.
## 3 3 female 31 168.
## 4 4 male 27 209 
## 5 5 male 24 177.
```

]

.pull-right5[

How do I calculate the following table of the top 5 tallest Baselers?

```
## # A tibble: 5 x 3
## id height sex 
## <int> <dbl> <chr>
## 1 9676 219. male 
## 2 5623 213. male 
## 3 7214 213. male 
## 4 7059 212. male 
## 5 9538 210. male
```

]

---

# Quiz 4

.pull-left45[

Here is part of the baselers dataframe

```r
baselers %>%
  select(id, sex, age, height) %>%
  slice(1:5)
```

]

.pull-right5[

How do I calculate the following table of the top 5 tallest Baselers?

```r
baselers %>%
  arrange(desc(height)) %>%
  select(id, height, sex) %>%
  slice(1:5)
```

```
## # A tibble: 5 x 3
## id height sex 
## <int> <dbl> <chr>
## 1 9676 219. male 
## 2 5623 213. male 
## 3 7214 213. male 
## 4 7059 212. male 
## 5 9538 210. male
```

]

---

# Quiz 5

.pull-left45[

Here is part of the baselers dataframe

```r
baselers %>%
  select(id, sex, age, height) %>%
  slice(1:5)
```

]

.pull-right5[

How do I calculate the following table of the top 5 tallest *female* Baselers?

```
## # A tibble: 5 x 3
## id height sex 
## <int> <dbl> <chr> 
## 1 6936 198. female
## 2 8450 196. female
## 3 385 196. female
## 4 3203 195. female
## 5 4392 194 female
```

]

---

# Quiz 5

.pull-left45[

Here is part of the baselers dataframe

```r
baselers %>%
  select(id, sex, age, height) %>%
  slice(1:5)
```

]

.pull-right5[

How do I calculate the following table of the top 5 tallest *female* Baselers?

```r
baselers %>%
  filter(sex == "female") %>%
  arrange(desc(height)) %>%
  select(id, height, sex) %>%
  slice(1:5)
```

]

---

# Reshaping data

.pull-left45[

Two key functions that allow you to <high>`reshape`</high> a dataframe between 'wide'and 'long' formats.

Some functions require data to be in a certain shape.

#### Two key tidyr functions

| Function | Result |
|:----------|:-------|
|`gather()`|Move data from 'wide' to 'long' format|
|`spread()`|Move data from 'long' to 'wide' format|

]

.pull-right5[

<high>Wide vs. Long</high> data

```r
# Wide format
stock_w
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

```r
# Long format
stock_l
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

]

---

.pull-left45[

# `gather()`

```r
# Show wide data
stock_w
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

```r
# "Gather" wide data to long
stock_w %>% 
  gather(time,    # New group column
         measure, # New target column
         -id)     # ID column
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

]

.pull-right45[

# `spread()`

```r
# Show long data
stock_l
```

```
##   id time measure
## 1  a   t1      10
## 2  b   t1      20
## 3  c   t1      15
## 4  a   t2      20
## 5  b   t2      26
## 6  c   t2      30
```

```r
# "Spread" long data to wide
stock_l %>% 
  spread(time,    # Old group column
         measure) # Old target column
```

```
##   id t1 t2
## 1  a 10 20
## 2  b 20 26
## 3  c 15 30
```

]

---

.pull-left4[

# Summary

1 - Start by assigning your result to a new object to save it!

2 - "Keep the pipe <high>%>%</high> going" to continue working with your data frame.

3 - The output of dplyr functions will (almost) always be a <high>tibble</high>.

4 - You can almost always include <high>multiple operations</high> within each function.

]

.pull-right55[

```r
# Assign result to baslers_agg

baslers_agg <- baselers %>% 
 
 # Change column names with rename()
 rename(age_years = age,
 weight_kg = weight) %>% # PIPE!

# Select specific rows with filter()
 filter(age_years < 40) %>% # PIPE!
 
 # Create new columns witb mutate()
 mutate(debt_ratio = debt / income)
```

]

---

# Practical

<a href="https://therbootcamp.github.io/Intro2DataScience_2018Oct/_sessions/Wrangling/Wrangling_practical.html">Link to practical</a>