Data Wrangling

class: center, middle, inverse, title-slide

# Data Wrangling
### The R Bootcamp Twitter: <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### April 2018

---

.pull-left5[

# What is Wrangling?

Transform
- Adding new columns
- Combining columns
- Splitting columns

Organise
- Moving data between columns and rows
- Merging several dataframes
- Sorting data by columns

Aggregation
- Aggregate data according to variables
- Summarizing data across groups

]

.pull-right5[

]

---

.pull-left4[
# How do we wrangle data in R?

Answer: dplyr

Anytime you want to transform, organize, or aggregate data, use `dplyr`

]

.pull-right5[

]

---

.pull-left35[
# dplyr

dplyr is a combination of 3 things:

1. **`objects`** like dataframes
2. **`functions`** that **do** things to objects.
3. **`pipes`** `%>%` that string together objects and verbs

]

.pull-right6[

## The pipe %>%

dplyr makes extensive use of the 'pipe' %>%

```r
# Applying function without pipe
FUN(OBJECT)

# Applying function with pipe
OBJECT %>% FUN
```

Example

```r
# Calculate mean of a vector without pipe
mean(ToothGrowth$len)
```

```
## [1] 18.81
```

```r
# Same thing with pipe
ToothGrowth$len %>% mean()
```

```
## [1] 18.81
```

]

---
## The pipe

.pull-left3[
 
 Here is some trial data called trial

```r
# Clinical trial data
trial
```

```
##   id sex arm age outcome
## 1  1   m   1  54       3
## 2  2   m   1  37       6
## 3  3   f   2  65       8
## 4  4   f   2  48       4
## 5  5   m   1  64       9
## 6  6   f   2  57       6
## 7  7   f   1  36       8
## 8  8   m   2  56      10
```

]

.pull-right65[

Task 1: Give me the ids of women above the age of 50

```r
# Base R Method without pipe (%>%)

trial$id[trial$sex == "f" & trial$age > 50]
```

```
## [1] 3 6
```

```r
# dplyr method with pipe (%>%)

trial %>%
  filter(sex == "f" & age > 50) %>%
  select(id)
```

```
##   id
## 1  3
## 2  6
```

]

---
## The pipe

.pull-left3[
 
 Here is some trial data called trial

```r
# Clinical trial data
trial
```

]

.pull-right65[

Task 2: Separately for the two arms, calculate mean age and outcome

```r
# Base R Method without pipe (%>%)

# Get aggregate statistics
age_agg <- aggregate(age ~ arm, data = trial, FUN = mean)
out_agg <- aggregate(outcome ~ arm, data = trial, FUN = mean)

# Combine results
data.frame(arm = age_agg$arm,
           age_mean = age_agg$age,
           outcome_mean = out_agg$outcome)
```

```
##   arm age_mean outcome_mean
## 1   1    47.75          6.5
## 2   2    56.50          7.0
```
]

---
## The pipe

.pull-left3[
 
 Here is some trial data called trial

```r
# Clinical trial data
trial
```

]

.pull-right65[

Task 2: Separately for the two arms, calculate mean age and outcome

```r
# dplyr method with pipe (%>%)

trial %>%                             # Start with data trial
  group_by(arm) %>%               # Group data by arm
  summarise(                      # Calculate summary statistics
    age_mean = mean(age),         # age_mean is...
    outcome_mean = mean(outcome)  # outcome_mean is...
  )
```

```
## # A tibble: 2 x 3
## arm age_mean outcome_mean
## <dbl> <dbl> <dbl>
## 1 1. 47.8 6.50
## 2 2. 56.5 7.00
```
]

---

.pull-left25[
 
Load the tidyverse

<br2>

Load data

Start with data

Connect multiple functions with pipes %>%

Result will always be a tibble

]

.pull-right7[

Here is how dplyr looks in practice

```r
library(tidyverse) # Load tidyverse (includes dplyr)

# load .csv file as trial
trial <- read_csv("data/mydata.csv")

# Start with original dataframe 
trial %>%   # AND THEN
  
  # Change column names with rename()
  rename(...)  %>%  # AND THEN...

# Select specific rows with filter()
  filter(...)  %>%  # AND THEN...
  
  # Create new columns witb mutate()
  mutate(...)  %>%  # ...
  
  # Group data with group_by()
  group_by(...) %>%  #...
  
  # Calculate summary statistics with summarise()
  summarise(...) %>% # ...

#....
```

]

---

.pull-left4[
 
## Wrangling Functions in dplyr

- There are dozens functions in dplyr that allow you to wrangle data.

- For an overview, look at the cheatsheet..

- We will now show you how to use ~10 of the most common ones

]

.pull-right55[

### Wrangling / dplyr cheatsheet

]

---

.pull-left4[
# Transformation Functions

| Function| Description|
|:-------------|:----|
| `rename()` | Change column names |
| `mutate()`|   Create a new column from existing columns|
| `case_when()`|  Recode values from a vector to another|
| `left_join()` | Combine multiple dataframes|

]

.pull-right55[

```r
patients_df   # Demographic data
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 13. 38. 0.
## 2 22. 64. 1.
## 3 5. 55. 0.
## 4 67. 34. 0.
## 5 34. 29. 1.
```

### Goal:

- Create better column names with `rename()`
- Add new columns with `mutate()`
- Recode values with `case_when()`
- Merge with a new dataframe using `left_join()`

]

---

.pull-left3[
# rename()

Change column names with rename() 
New = Old, New = Old, ...

```r
patients_df  # Original
```

```
## # A tibble: 5 x 3
## id b c
## <dbl> <dbl> <dbl>
## 1 13. 38. 0.
## 2 22. 64. 1.
## 3 5. 55. 0.
## 4 67. 34. 0.
## 5 34. 29. 1.
```
]

.pull-right65[
1) Create better column names with rename()

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Change column names with rename()
  rename(age = b,  # New = Old
         arm = c)  # New = Old
```

```
## # A tibble: 5 x 3
## id age arm
## <dbl> <dbl> <dbl>
## 1 13. 38. 0.
## 2 22. 64. 1.
## 3 5. 55. 0.
## 4 67. 34. 0.
## 5 34. 29. 1.
```

]

---

.pull-left3[
# mutate()

Add new columns with mutate() 
Add as many you'd like 
Example:

```r
df %>%
  mutate(
   NEW = DEFINITION,
   Year = Month * 12,
   BMI = weight / height ^ 2
  )
```
]

.pull-right65[
2) Add new columns with mutate()

```
## # A tibble: 5 x 5
## id age arm age_months age_decades
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 13. 38. 0. 456. 3.80
## 2 22. 64. 1. 768. 6.40
## 3 5. 55. 0. 660. 5.50
## 4 67. 34. 0. 408. 3.40
## 5 34. 29. 1. 348. 2.90
```

]

---

.pull-left3[
# case_when()

Use case_when() in combination with mutate() 
Allows you to define values based on logical conditions 
Example:

```r
df %>%
  mutate(
    NEW = case_when(
      COND1 ~ VAL1,
      COND2 ~ VAL2
    )
  )
```
]

.pull-right65[
3) Recode values with case_when()

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Change column names with rename()
  rename(age = b,  
         arm = c) %>%
  
# 2) Create new columns with mutate()
  mutate(age_months = age * 12,
         age_decades = age / 10,
         arm_char = case_when(arm == 0 ~ "placebo",
                              arm == 1 ~ "drugtrial")
         )
```

```
## # A tibble: 5 x 6
## id age arm age_months age_decades arm_char 
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
## 1 13. 38. 0. 456. 3.80 placebo 
## 2 22. 64. 1. 768. 6.40 drugtrial
## 3 5. 55. 0. 660. 5.50 placebo 
## 4 67. 34. 0. 408. 3.40 placebo 
## 5 34. 29. 1. 348. 2.90 drugtrial
```

]

---
# Merging data

.pull-left45[

Patient basics

```r
patients_df
```

| id| age| arm|
|--:|---:|---:|
| 13|  38|   0|
| 22|  64|   1|
|  5|  55|   0|
| 67|  34|   0|
| 34|  29|   1|

Goal: Add study results data to patient data
<br3>

]

.pull-right45[

Patient results

```r
results_df
```

| id| time1| time2|
|--:|-----:|-----:|
| 67|     2|     4|
|  5|     3|     3|
| 34|     3|     5|
| 13|     4|     6|
| 22|     7|     5|

For each matching id in patients_df and results_df, add results_df data to patients_df.

]

---

.pull-left3[
# left_join()

Use left_join() to combine data from 2 dataframes 
Must include at least one 'key' column used to match rows 
Example:

```r
# Add df2 data to df1
#   using KEY as the key
df1 %>%
  left_join(df2, 
      by = c("KEY"))
```

]

.pull-right65[
4) Merge patients_df with results_df

```
## # A tibble: 5 x 8
## id age arm age_months age_decades arm_char time1 time2
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13. 38. 0. 456. 3.80 placebo 4. 6.
## 2 22. 64. 1. 768. 6.40 drugtrial 7. 5.
## 3 5. 55. 0. 660. 5.50 placebo 3. 3.
## 4 67. 34. 0. 408. 3.40 placebo 2. 4.
## 5 34. 29. 1. 348. 2.90 drugtrial 3. 5.
```

]

---

.pull-left3[

# Notes

1) Continue as long as you want with more pipes %>%
 
 2) Order of functions matters 
 
 3) You can refer to new objects in later code 
 
4) To update a dataframe, make sure to assign it to itself with <-

]

.pull-right65[
 
Make sure to assign again using <- to change patients_df

```r
# 0) Start with patients_df data
patients_df <- patients_df %>%
 
# 1) Change column names with rename()
 rename(age = b, 
 arm = c) %>%
 
# 2) Create new columns with mutate()
 mutate(age_months = age * 12,
 age_decades = age / 10,
 arm_char = case_when(arm == 0 ~ "placebo",
 arm == 1 ~ "drugtrial")
 ) %>%
 
# 3) Add data from results_df with left_join()
 left_join(results_df, by = "id")
```

]

---
.pull-left3[

# Organisation Functions

| Function| action|
|:-------------|:----|
| `arrange()` |Sort rows |
| `filter()` | Select specific rows |
| `select()`|   Select specific columns|

]

.pull-right65[

```r
patients_df 
```

### Goal:

- Sort rows by `id` then by `age`
- Only use drugtrial over the age of 30
- Get rid of the columns `age_months` and `age_decades`

]

---

.pull-left3[
# arrange()

The arrange() function is used to sort (aka, arrange) rows of a dataframe
 
You can sort by as many conditions as you want
 
To sort in descending order, use desc()

]

.pull-right65[

1) Sort rows by id then in descending order of age

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Sort by id then age (descending)
  arrange(id, desc(age))
```

```
## # A tibble: 5 x 8
## id age arm age_months age_decades arm_char time1 time2
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 5. 55. 0. 660. 5.50 placebo 3. 3.
## 2 13. 38. 0. 456. 3.80 placebo 4. 6.
## 3 22. 64. 1. 768. 6.40 drugtrial 7. 5.
## 4 34. 29. 1. 348. 2.90 drugtrial 3. 5.
## 5 67. 34. 0. 408. 3.40 placebo 2. 4.
```

]

---

.pull-left3[
# filter()

The filter() function is used to select specific rows
 
For complex conditions, chain multiple logical comparison operators (==, !=, <, >, <=, >=, &, |)

]

.pull-right65[

2) Select drugtrial over 30

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Sort by id then age (descending)
  arrange(id, desc(age)) %>%
  
# 2) Only drugtrial over 30
  filter(arm_char == "drugtrial" & age > 30)
```

```
## # A tibble: 1 x 8
## id age arm age_months age_decades arm_char time1 time2
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 22. 64. 1. 768. 6.40 drugtrial 7. 5.
```

]

---

.pull-left3[
# select()

Keep columns with select(COL1, COL2, ...)
 
Remove columns with select(-COL1, -COL2, ...)

]

.pull-right65[

3) Remove age_months and age_decades

```
## # A tibble: 1 x 6
## id age arm arm_char time1 time2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 22. 64. 1. drugtrial 7. 5.
```

]

---
.pull-left4[

# Aggregation functions

| Function| action|
|:-------------|:----|
| `group_by()` |Group data by levels of specific variables |
| `summarise()` | Calculate summary statistics |

### Statistical functions

| Function| action|
|:-------------|:----|
| `min(), max()`| Minimum, maximum |
| `mean(), median()` |Mean, Median |
| `sd()` |Standard deviation|
| `sum()` | Sum|
| `n()`| Number of cases|
]

.pull-right55[

```r
patients_df
```

```
## # A tibble: 5 x 5
## id age arm time1 time2
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13. 38. placebo 4. 6.
## 2 22. 64. drugtrial 7. 5.
## 3 5. 55. placebo 3. 3.
## 4 67. 34. placebo 2. 4.
## 5 34. 29. drugtrial 3. 5.
```

### Goal:

- Group data by arm
- Calculate summary statistics from time1 and time1

]

---
.pull-left3[
# group_by()

Used to combine data into groups
 
You can group by as many variables as you wish
 
You won't see any changes to the data until you use summarise()

]

.pull-right65[

1) Group data by arm

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Group data by arm with group_by
  group_by(arm)
```

```
## # A tibble: 5 x 5
## # Groups: arm [2]
## id age arm time1 time2
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 13. 38. placebo 4. 6.
## 2 22. 64. drugtrial 7. 5.
## 3 5. 55. placebo 3. 3.
## 4 67. 34. placebo 2. 4.
## 5 34. 29. drugtrial 3. 5.
```

]

---
.pull-left3[
# summarise()

Used to summarise data from groups 
Works like mutate() to create new variables, but with summary functions

```r
df %>%
  group_by(VAR1) %>%
  summarise(
    SUMMARY = FUN(x),       # 
    age_mean = mean(age),   # Ex 1
    height_min = min(height) # Ex 2 
  )
```

]

.pull-right65[

2) Calculate summary statistics from times

```r
# 0) Start with patients_df data
patients_df %>%
  
# 1) Group data by arm with group_by
  group_by(arm) %>%
  
# 2) Calculate summary columns
  summarise(
    time1_mean = mean(time1),
    time2_mean = mean(time2),
    diff = time2_mean - time1_mean,
    N = n()
  )
```

```
## # A tibble: 2 x 5
## arm time1_mean time2_mean diff N
## <chr> <dbl> <dbl> <dbl> <int>
## 1 drugtrial 5. 5.00 0. 2
## 2 placebo 3. 4.33 1.33 3
```

]

---

.pull-left4[

#dplyr summary

dplyr is great for elegantly performing sequential operations on data.

The 'pipe' operator `%>%` helps you string multiple *objects* (like dataframes) and *verbs* (summarise, order, aggregate...) together.

]

.pull-right5[

Basic structure of dplyr commands:

```r
data %>%    # Start with data, AND THEN...
  VERB1 %>% # Do VERB1, AND THEN...
  VERB2 %>% # Do VERB2, AND THEN...
  VERB3 %>% # Do VERB3, AND THEN...
  group_by(x, y) %>%  # Group by variables x, y
    summarise(
      VAR_A_New = fun(trial),
      VAR_B_New = fun(Y)
    )
  )
```

]

---

.pull-left3[

## Important!

You can call many dplyr functions directly without the pipe %>%

However, we recommend always using the pipe %>% so you can string many operations together

]

.pull-right65[

Use the pipes %>%!

```r
## Using dplyr functions without pipes %>%
##     Avoid this if you can

data <- mutate(data, age_years = age_months * 12)
data <- filter(data, sex == "m")

## Using dplyr functions with pipes %>%
##    Much better!!!

data <- data %>% mutate(age_years = age_months * 12)
data <- data %>% filter(sex == "m")

# Why? Because with pipes you can easily put multiple
#  functions together if you choose.

data <- data %>% 
 mutate(age_years = age_months * 12) %>%
 filter(sex == "m") %>%
 # ....
```

]

---

# Live Demo & Practical

<a href="https://therbootcamp.github.io/BaselRBootcamp_2018April/_sessions/D2S1_Wrangling/Wrangling_practical.html">Link to Wrangling practical</a>

---

# Additional Slides

---
# Vector functions

---

.pull-left5[
# Vector to scaler functions

- Each takes a numeric vector `x`, and returns a numeric scaler

| Function| action|
|:-------------|:----|
|     `sum(x)`, `prod(x)`|   Sum or product|
|     `min(x), max(x)`|    Minimum, maximum|
|     `mean(x), median(x)`|    Arithmetic mean or median| 
|     `var(x)`, `sd(x)`|   Variance, standard deviation|

]

.pull-right45[

### Examples

```r
# Sum of all weights?
sum(ChickWeight$weight)
```

```
## [1] 70411
```

```r
# Mean weight?
mean(ChickWeight$weight)
```

```
## [1] 121.8
```

```r
# Standard deviation of weights?
sd(ChickWeight$weight)
```

```
## [1] 71.07
```

]

---

.pull-left5[
# Vector to vector / table functions

- Each takes a numeric vector `x`, and returns a vector or table

| Function| action|
|:-------------|:----|
|     `unique(x)`|   What are the unique values of x?|
|     `table(x)`|    How many are there of each unique value in x?|
|     `rank(x)|    What is the rank order of values in x?|

]

.pull-right45[

### Examples

```r
# Sum of all weights?
sum(ChickWeight$weight)
```

```
## [1] 70411
```

```r
# Mean weight?
mean(ChickWeight$weight)
```

```
## [1] 121.8
```

```r
# Standard deviation of weights?
sd(ChickWeight$weight)
```

```
## [1] 71.07
```

]

---

.pull-left4[
# dplyr

dplyr is a combination of 3 things:

1. **`objects`** like dataframes
2. **`verbs`** that **do** things to objects.
3. **`pipes`** `%>%` that string together objects and verbs

]

.pull-right5[

dplyr is meant to be sequential and work like language

> Take data trial, 
>   then do Y, 
>   then do Z,
>   ...

]

---

.pull-left4[
# dplyr

### Question:

> From the ChickWeight dataframe, calculate the mean weight and time for each diet

]

.pull-right55[

### Answer:

```r
library(dplyr)

x <- ChickWeight %>% # Start with ChickWeight
 group_by(Diet) %>% # Group by Diet
 summarise( # Get ready to summarise....
 weight.mean = mean(weight), # Mean weight
 time.mean = mean(Time), # Mean time
 N = n() # Number of cases
 )

x
```

```
## # A tibble: 4 x 4
## Diet weight.mean time.mean N
## <fct> <dbl> <dbl> <int>
## 1 1 103. 10.5 220
## 2 2 123. 10.9 120
## 3 3 143. 10.9 120
## 4 4 135. 10.8 118
```

]

---

# Common dplyr verbs

| verb| action| example |
|:---|:----|:----------------|
|     `filter()`|    Select rows based on some criteria| `filter(age > 40 & treatment == "drug")`|
|     `arrange()`|    Sort rows| `arrange(date, group)`|
|     `select()`|    Select columns (and ignore all others)| `select(age, treatment)`|
|     `rename()`|    Rename columns| `rename(DATE_MONTHS_trial24, date`)|
|     `mutate()`|    Add new columns| `mutate(height.m = height.cm / 100)`|
|     `case_when()`|    Recode values of a column| `treat_char = case_when(treatment == 0 ~ "placebo", treatment == 1 ~ "drug")`|
|     `group_by(), summarise()`|   Group data and then calculate summary statistics|`group_by(treatment) %>% summarise(...)` |

---

#### Example 1

*Add a column called `weight_d_time` that is weight divided by time*

```r
library(dplyr)

x <- ChickWeight %>% # Start with the ChickWeight data
 mutate( # Create new columns...
 weight_d_time = weight / Time
 )

head(x)   # Print the result
```

```
##   weight Time Chick Diet weight_d_time
## 1     42    0     1    1           Inf
## 2     51    2     1    1         25.50
## 3     59    4     1    1         14.75
## 4     64    6     1    1         10.67
## 5     76    8     1    1          9.50
## 6     93   10     1    1          9.30
```

---

#### Example 2

*Add a column called `weight_d_time` that is weight divided by time AND `time_d` that is time in days*

```r
x <- ChickWeight %>% # Start with the ChickWeight data
 mutate( # Create new columns...
 weight_d_time = weight / Time, # weight_d_time is weight divided by Time
 time_d = Time * 7 # time_d is Time times 7
 )

head(x)   # Print the result
```

```
##   weight Time Chick Diet weight_d_time time_d
## 1     42    0     1    1           Inf      0
## 2     51    2     1    1         25.50     14
## 3     59    4     1    1         14.75     28
## 4     64    6     1    1         10.67     42
## 5     76    8     1    1          9.50     56
## 6     93   10     1    1          9.30     70
```

---

### Recoding values with case_when()

.pull-left4[
Recoding values is a common data wrangling task. You can easily do this with `case_when()`:

```r
data %>%
  mutate(
  var_new = case_when(
    var_old == OLD_A ~ NEW_A,
    var_old == OLD_B ~ NEW_B
  )
```

For example, in a dataset, the column `sex` might be coded with 1s and 0s.

You might want to create a new column `sex_new` where 1 = "female" and 0 = "male":

| sex| sex_new|
|------:|----:|
|    1|   "female"|
|    0|   "male"|

]

.pull-right5[

To change the value of 1 to `"female"`, and 0 to `"male"`, you can use `case_when()`:

```r
# Add a column sex_new to data

data <- data %>% 
 mutate(
 sex_new = case_when(
 arm == 1 ~ "drugtrial",
 arm == 0 ~ "placebo"
 )
 )
```

You can think about the code above as follows:

- Create a new column `sex_new` where
    - If `sex == 1`, then set the value to `"female"`
    - If `sex == 0`, then set the value to `"male"`

]

---

.pull-left4[

#### Example 3

*Create a new variable Diet_name which shows Diet in text format. Here is a table of the values*

| Diet| Diet_name|
|------:|----:|
|    1|   "fruit"|
|    2|   "vegetables"|
|    3|   "meat"|
|    4|   "grains"|

]

.pull-right5[

```r
ChickWeight <- ChickWeight %>% # Start with the ChickWeight data
 mutate(
 Diet_name = case_when(
 Diet == 1 ~ "fruit",
 Diet == 2 ~ "vegetables",
 Diet == 3 ~ "meat",
 Diet == 4 ~ "grains"
 )
 
 )

head(ChickWeight)
```

```
##   weight Time Chick Diet Diet_name
## 1     42    0     1    1     fruit
## 2     51    2     1    1     fruit
## 3     59    4     1    1     fruit
## 4     64    6     1    1     fruit
## 5     76    8     1    1     fruit
## 6     93   10     1    1     fruit
```

]

---

#### Example 4

*For each Diet, calculate the mean weight*

```r
ChickWeight %>%           # Start with the ChickWeight data
  group_by(Diet) %>%      # Group the data by Diet
  summarise(              # Now summarise....
    weight.mean = mean(weight) # Mean weight
  )
```

```
## # A tibble: 4 x 2
## Diet weight.mean
## <fct> <dbl>
## 1 1 103.
## 2 2 123.
## 3 3 143.
## 4 4 135.
```

---

#### Example 5

*For each time period less than 10, calculate the mean weight*

```r
ChickWeight %>% # Start with the ChickWeight data
 filter(Time < 10) %>% # Only Time periods less than 10
 group_by(Time) %>% # Group the data by Diet
 summarise( # Now summarise....
 weight.mean = mean(weight) # Mean weight
 )
```

```
## # A tibble: 5 x 2
## Time weight.mean
## <dbl> <dbl>
## 1 0. 41.1
## 2 2. 49.2
## 3 4. 60.0
## 4 6. 74.3
## 5 8. 91.2
```

---
#### Example 6

*For each Diet, calculate the mean weight, maximum time, and the number of chicks on each diet*:

```r
ChickWeight %>%           # Start with the ChickWeight data
  group_by(Diet) %>%      # Group the data by Diet
  summarise(              # Now summarise....
    weight.mean = mean(weight), # Mean weight
    time.max = max(Time),       # Max time
    N = n()                     # Number of observations
  )
```

```
## # A tibble: 4 x 4
## Diet weight.mean time.max N
## <fct> <dbl> <dbl> <int>
## 1 1 103. 21. 220
## 2 2 123. 21. 120
## 3 3 143. 21. 120
## 4 4 135. 21. 118
```

---

# Other dplyr verbs

| verb| action| example |
|:---|:----|:----------------|
|     `sample_n()`|    Select a random sample of n rows| `sample_n(10)`|
|     `sample_frac()`|    Select a random fraction of rows| `sample_frac(.20)`|
|     `first(), last()`|    Give the first (or last) observation| `first(), last()`|

---

#### Example 7

*Give me a random sample of 10 rows from the ChickWeight dataframe, but only show me the values for Chick and weight*

```r
# Give me a random sample of 10 rows, but only show me columns Chick and weight

ChickWeight %>% 
  select(Chick, weight) %>%
  sample_n(10)
```

```
##     Chick weight
## 410    36    227
## 449    40     41
## 308    28     73
## 416    37     68
## 209    20     41
## 291    26    236
## 530    46    238
## 437    39     42
## 565    49    233
## 236    22     77
```

---
.pull-left4[

#dplyr

dplyr operations (almost) always return a dataframe which you can assign to a new object:

> *Create a dataframe with the average weight for each time period and nothing else!!*

]

.pull-right5[

```r
# Create a new object called time_agg

time_agg <- ChickWeight %>%
 group_by(Time) %>%
 summarise(
 weight.mean = mean(weight)
 )

head(time_agg)
```

```
## # A tibble: 6 x 2
## Time weight.mean
## <dbl> <dbl>
## 1 0. 41.1
## 2 2. 49.2
## 3 4. 60.0
## 4 6. 74.3
## 5 8. 91.2
## 6 10. 108.
```

]