class: center, middle, inverse, title-slide # Wrangling ### R for Data Science
Basel R Bootcamp
### February 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> R For Data Science | February 2019 </font> </a> </span> </div> --- # What is wrangling? <p align = "center"> <img src="image/wrangling.jpeg" height=450px><br> <font style="font-size:10px">from <a href="https://datasciencebe.com/tag/data-wrangling/">datasciencebe.com</a></font> </p> --- .pull-left45[ # This is wrangling! <div style="padding-bottom:10px"></div> #### <high>Transform</high> Change variable names Add new variables #### <high>Organise</high></font> Sort data by variables Merging data from two separate dataframes Move data between variables and rows #### <high>Aggregate and summarise</high></font> Group data and calculate and summary stats ] .pull-right5[ <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/organise_transform_aggregate.png" height = "530px"> </p> ] --- # `dplyr` + `tidyr` .pull-left5[ To wrangle data in R, we will use the <high><mono>dplyr</mono></high> and <high><mono>tidyr</mono></high> packages, which are part of the <high><mono>tidyverse</mono></high>. | Package | Function| Function| |:-------------|:----| |<b>dplyr</b> | Transformation | `rename()`, `mutate()`, `case_when()`, `*_join()` | |<b>dplyr</b> | Organisation | `arrange()`, `slice()`, `filter()`, `select()` | |<b>tidyr</b> | Organisation | `gather()`, `spread()` | |<b>dplyr</b> | Aggregation | `group_by()`, `summarise()` | ] .pull-right4[ <p align = "center"> <img src="image/packages.png" height=320px> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ `dplyr` makes extensive use of a new operator called the <high><i>Pipe</i> `%>%`</high><br> Read the <high><i>Pipe</i> `%>%`</high> as "And Then..." <br> ```r # Start with data data %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... DO_SOMETHING %>% # AND THEN... ``` ] .pull-right55[ <p align="center"> <img src="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg/300px-MagrittePipe.jpg" width = "400px"><br> <br><i>This is not a pipe. But <mono>%>%</mono> is!</i> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ ```r # Vector of `scores` score <- c(8, 4, 6, 3, 7, 3) score ``` ``` ## [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-way mean(score) ``` ``` ## [1] 5.167 ``` ```r # Mean: Tidyverse-style (with %>%) score %>% # AND THEN mean() ``` ``` ## [1] 5.167 ``` ] .pull-right55[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- # The Pipe! <high>`%>%`</high> .pull-left4[ ```r # Vector of `scores` score <- c(8, 4, 6, 3, 7, 3) score ``` ``` ## [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-way round(mean(score), digits = 1) ``` ``` ## [1] 5.2 ``` ```r # Mean: Tidyverse-style (with %>%) score %>% # AND THEN mean() %>% # AND THEN round(digits = 1) ``` ``` ## [1] 5.2 ``` ] .pull-right55[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- .pull-left4[ # 2 dirty data sets ### Goals <b>1</b> <i>Rename</i> - Give meaningful <high>variable names</high>.<br><br> <b>2</b> <i>Recode</i> - Use appropriate <high>units</high> and <high>labels</high> for nominal variables.<br><br> <b>3</b> <i>Join</i> - <high>Combine</high> datasets.<br><br> <b>4</b> <i>Sort</i> - <high>Sort</high> tibble by age. <br><br> <b>5</b> <i>Filter</i> - Select relevant <high>cases</high>. <br><br> <b>6</b> <i>Select</i> - Select relevant <high>variables</high>.<br><br> <b>7</b> <i>Gather</i> - Change to <high>long format</high>.<br><br> ] .pull-right45[ <br><br2> ```r # patients tibble patients ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ```r # results tibble results ``` ``` ## # A tibble: 5 x 3 ## id t_1 t_2 ## <dbl> <dbl> <dbl> ## 1 4 100 105 ## 2 92 134 150 ## 3 1 123 135 ## 4 2 143 140 ## 5 99 102 68 ``` ] --- # Transformation .pull-left4[ Transformation functions are used to <high>alter the content</high> of a `tibble`. <br2> | Function| Description| |:-------------|:----| | `rename()` | <high>Change names</high> of variables | | `mutate()`| <high>Create variable</high> from existing variables| | `case_when()`| <high>Recode values</high> from a vector to another| | `left_join()` | <high>Combine tibbbles</high> | ] .pull-right5[ ```r patients # patients data ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ] --- # `rename()` .pull-left4[ <high>Change variable names</high> with `rename()`. ```r patients %>% rename(NEW = OLD, NEW = OLD) ``` ```r patients # Original ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ] .pull-right5[ Change `X1` to `age`, and `X2` to `arm`. ```r # 0) Start with patients data patients %>% # 1) Change variable names with rename() rename(age = X1, # New = Old arm = X2) # New = Old ``` ``` ## # A tibble: 5 x 3 ## id age arm ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ] --- # `mutate()` .pull-left4[ Create <high>new variables</high>, or <high>change existing ones</high>, with `mutate()`. ```r tibble %>% mutate( NEW1 = DEFINITION1, NEW2 = DEFINITION2, NEW3 = DEFINITION3, ... ) ``` ] .pull-right5[ Calculate two new variables `age_months` and `age_decades`. ```r patients %>% rename(age = X1, arm = X2) %>% # AND THEN... # Create new variables with mutate() mutate(age_months = age * 12, age_decades = age / 10) ``` ``` ## # A tibble: 5 x 5 ## id age arm age_months age_decades ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 37 1 444 3.7 ## 2 2 65 2 780 6.5 ## 3 3 57 2 684 5.7 ## 4 4 34 1 408 3.4 ## 5 5 45 2 540 4.5 ``` ] --- # `case_when()` .pull-left4[ Use `case_when()` with `mutate()` to define <high>new variables based on logical conditions</high>. ```r # Using mutate(case_when()) tibble %>% mutate( NEW = case_when( COND1 ~ VAL1, COND2 ~ VAL2 )) ``` ] .pull-right55[ Create `arm_lab` that carries `'placebo'` for `arm == 1` and `'drug'` for `arm == 2`. ```r patients %>% rename(age = X1, arm = X2) %>% # Create arm_lab from arm mutate(arm_lab = case_when(arm == 1 ~ "placebo", arm == 2 ~ "drug")) ``` ``` ## # A tibble: 5 x 4 ## id age arm arm_lab ## <dbl> <dbl> <dbl> <chr> ## 1 1 37 1 placebo ## 2 2 65 2 drug ## 3 3 57 2 drug ## 4 4 34 1 placebo ## 5 5 45 2 drug ``` ] --- # Joining data <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/joining_data.png" height="450px"> </p> --- # `left_join()` .pull-left35[ Use `left_join()` to <high>combine two data frames</high> based on one or more key variables. ```r # Join tibble_2 to tibble_1 # matched by KEY tibble_1 %>% left_join(tibble_2, by = c("KEY")) ``` Other `*_join()` functions: `right_join()`, `full_join()`, `inner_join()`, `anti_join()`, `semi_join()`. ] .pull-right6[ ```r # Join patients with results to create combined combined <- patients %>% rename(age = X1, arm = X2) %>% mutate(arm_lab = case_when(arm == 1 ~ "placebo", arm == 2 ~ "drug")) %>% # Join with results with left_join() left_join(results, by = "id") ``` ```r # Show combined data set combined ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] --- # Organisation .pull-left4[ Organisation functions help you change the organisation of your data by <high>sorting rows</high> by variables, <high>filter rows</high> based on criteria, <high>select variables</high> (etc). | Function| Description| |:--------|:----|:-------------| | `arrange()` |<high>Sort rows</high> by variables| | `slice()`| <high>Select rows</high> by location| | `filter()` | <high>Select rows</high> by criteria| | `select()`| <high>Select variables</high>| ] .pull-right55[ ```r # combined tibble combined ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] --- # `arrange()` .pull-left4[ Use `arrange()` to <high>sort rows</high> in increasing or decreasing (using `desc()`) order of one or more variables. ```r tibble %>% arrange(A, B) ``` To sort in descending order, use `desc()` ```r tibble %>% arrange(desc(A), B) ``` ] .pull-right55[ Sort by `arm`. ```r combined %>% arrange(arm) # Sort by arm ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 4 34 1 placebo 100 105 ## 3 2 65 2 drug 143 140 ## 4 3 57 2 drug NA NA ## 5 5 45 2 drug NA NA ``` ] --- # `arrange()` .pull-left4[ Use `arrange()` to <high>sort rows</high> in increasing or decreasing (using `desc()`) order of one or more variables. ```r tibble %>% arrange(A, B) ``` To sort in descending order, use `desc()` ```r tibble %>% arrange(desc(A), B) ``` ] .pull-right55[ Sort by `arm` and then `age`. ```r combined %>% arrange(arm, age) # Sort by arm then age ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 4 34 1 placebo 100 105 ## 2 1 37 1 placebo 123 135 ## 3 5 45 2 drug NA NA ## 4 3 57 2 drug NA NA ## 5 2 65 2 drug 143 140 ``` ] --- # `slice()` .pull-left4[ Use `slice()` to <high>select rows</high> (or remove) by row number. Use `c()`, `a:b`, or `seq()` to create row numbers ```r # Integer vector c(2, 6, 10) ``` ``` ## [1] 2 6 10 ``` ```r # Integer vector of 0 to 5 0:5 ``` ``` ## [1] 0 1 2 3 4 5 ``` ] .pull-right55[ Select rows `3` and `5`. ```r # Rows 3 and 5 only combined %>% slice(c(3, 5)) ``` ``` ## # A tibble: 2 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 3 57 2 drug NA NA ## 2 5 45 2 drug NA NA ``` ] --- # `slice()` .pull-left4[ Use `slice()` to <high>select rows</high> (or remove) by row number. Use `c()`, `a:b`, or `seq()` to create row numbers ```r # Integer vector c(2, 6, 10) ``` ``` ## [1] 2 6 10 ``` ```r # Integer vector of 0 to 5 0:5 ``` ``` ## [1] 0 1 2 3 4 5 ``` ] .pull-right55[ Select rows `1` through `4`. ```r # First 4 rows combined %>% slice(1:4) ``` ``` ## # A tibble: 4 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ``` ] --- # `filter()` .pull-left4[ Use `filter()` to <high>select rows</high> (or remove) based on logical statements. <high>Chain</high> logical comparison operators with `&` (AND) and `|` (OR). <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right55[ Select <i>only</i> patients over `30`. ```r # Filter patients older than 30 combined %>% filter(age > 30) ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] --- # `filter()` .pull-left4[ Use `filter()` to <high>select rows</high> (or remove) based on logical statements. <high>Chain</high> logical comparison operators with `&` (AND) and `|` (OR). <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`<=`</high>, <high>`>=`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right55[ Select <i>only</i> patients over `30` given arm is `'drug'`. ```r # Filter patients older than 30 given drug combined %>% filter(age > 30 & arm_lab == "drug") ``` ``` ## # A tibble: 3 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 2 65 2 drug 143 140 ## 2 3 57 2 drug NA NA ## 3 5 45 2 drug NA NA ``` ] --- # `select()` .pull-left4[ Use `select()` to <high>select variables</high> (and remove all others) ```r # Select variables A, B tibble %>% select(A, B) ``` <high>Remove variables</high> with `-`. ```r # Select everything BUT A tibble %>% select(-A) ``` ] .pull-right55[ Select variables `id` and `arm`. ```r combined %>% select(id, arm) # Select id and arm variables ``` ``` ## # A tibble: 5 x 2 ## id arm ## <dbl> <dbl> ## 1 1 1 ## 2 2 2 ## 3 3 2 ## 4 4 1 ## 5 5 2 ``` ] --- # `select()` .pull-left4[ Use `select()` to <high>select variables</high> (and remove all others) ```r # Select variables A, B tibble %>% select(A, B) ``` <high>Remove variables</high> with `-`. ```r # Select everything BUT A tibble %>% select(-A) ``` ] .pull-right55[ Select everything <i>except</i> id ```r combined %>% select(-id) # Everything BUT id ``` ``` ## # A tibble: 5 x 5 ## age arm arm_lab t_1 t_2 ## <dbl> <dbl> <chr> <dbl> <dbl> ## 1 37 1 placebo 123 135 ## 2 65 2 drug 143 140 ## 3 57 2 drug NA NA ## 4 34 1 placebo 100 105 ## 5 45 2 drug NA NA ``` ] --- # Long and wide formats .pull-left4[ Some functions require data to be in a certain shape, that is to be either in a wide or a long format. Use <high>`gather()`</high> and <high>`spread()`</high> from the `tidyr` package to change a tibble between <high>wide</high> and <high>long</high> formats. <br> | Function | Result | |:----------|:-------| |`gather()`|<high>wide → long</high> format| |`spread()`|<high>long → wide</high> format| ] .pull-right55[ <p align="center"> <img src="image/longwide.png" height=420px> </p> ] --- # `gather()` .pull-left45[ ```r # Show original data (wide) combined ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] .pull-right45[ ```r # "Gather" wide data to long combined %>% gather(time, # New group variable value, # New target variable -id) # Omit id ``` ``` ## # A tibble: 25 x 3 ## id time value ## <dbl> <chr> <chr> ## 1 1 age 37 ## 2 2 age 65 ## 3 3 age 57 ## 4 4 age 34 ## 5 5 age 45 ## 6 1 arm 1 ## 7 2 arm 2 ## 8 3 arm 2 ## 9 4 arm 1 ## 10 5 arm 2 ## # … with 15 more rows ``` ] --- # `gather()` .pull-left45[ ```r # Show original data (wide) combined ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] .pull-right45[ ```r # "Gather" wide data to long combined %>% gather(time, # New group variable value, # New target variable -id, -age, -arm, -arm_lab) # Omit variables ``` ``` ## # A tibble: 10 x 6 ## id age arm arm_lab time value ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1 37 1 placebo t_1 123 ## 2 2 65 2 drug t_1 143 ## 3 3 57 2 drug t_1 NA ## 4 4 34 1 placebo t_1 100 ## 5 5 45 2 drug t_1 NA ## 6 1 37 1 placebo t_2 135 ## 7 2 65 2 drug t_2 140 ## 8 3 57 2 drug t_2 NA ## 9 4 34 1 placebo t_2 105 ## 10 5 45 2 drug t_2 NA ``` ] --- # `spread()` .pull-left45[ ```r # Show long data combined %>% gather(time, # New group variable value, # New target variable -id, -age, -arm, -arm_lab) # Omit variables ``` ``` ## # A tibble: 10 x 6 ## id age arm arm_lab time value ## <dbl> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1 37 1 placebo t_1 123 ## 2 2 65 2 drug t_1 143 ## 3 3 57 2 drug t_1 NA ## 4 4 34 1 placebo t_1 100 ## 5 5 45 2 drug t_1 NA ## 6 1 37 1 placebo t_2 135 ## 7 2 65 2 drug t_2 140 ## 8 3 57 2 drug t_2 NA ## 9 4 34 1 placebo t_2 105 ## 10 5 45 2 drug t_2 NA ``` ] .pull-right45[ ```r # "Gather" wide data to long long_combined = combined %>% gather(time, # New group variable value, # New target variable -id, -age, -arm, -arm_lab) # Omit variables # "Spread" long data to wide long_combined %>% spread(time, # Old group variable value) # Old target variable ``` ``` ## # A tibble: 5 x 6 ## id age arm arm_lab t_1 t_2 ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 1 37 1 placebo 123 135 ## 2 2 65 2 drug 143 140 ## 3 3 57 2 drug NA NA ## 4 4 34 1 placebo 100 105 ## 5 5 45 2 drug NA NA ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/R4DS_2019Feb/_sessions/Wrangling/Wrangling_practical.html">Practical</a></h1>