class: center, middle, inverse, title-slide # Recap ### The R Bootcamp @ Erfurt
www.therbootcamp.com
@therbootcamp
### June 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">Erfurt, June 2018</font></a>                                           <a href="https://therbootcamp.github.io/"><font color="#646464">www.therbootcamp.com</font></a> </span></div> --- # Essentials: The 2<sup>4</sup> Rules of the R Bootcamp .pull-left4[ 1. Everything is an object 2. Use `<-` to create/change objects 3. Name objects using `_` 4. Objects have classes 5. Everything happens through functions 6. Functions have arguments 7. Arguments can have defaults 8. Functions expect certain object classes 9. View help files using `?` 10. Data is stored in data frames 11. Select variables (vectors) using `$` 12. Use RStudio and projects 13. Code in the editor 14. First load packages and data 15. Comment and format for readability 16. Struggle, ask for help, struggle... ] --- # The almighty **tidyverse** Among its many packages, R contains a collection of high-performance, easy-to-use packages (libraries) designed specifically for handling data know as the [tidyverse](https://www.tidyverse.org/). The tidyverse includes: 1. `ggplot2` -- creating graphics. 2. `dplyr` -- data manipulation. 3. `tidyr` -- tidying data. 4. `readr` -- read wild data. 5. `purrr` -- functional programming. 6. `tibble` -- modern data frame. <br><br> <img src="http://d33wubrfki0l68.cloudfront.net/0ab849ed51b0b866ef6895c253d3899f4926d397/dbf0f/images/hex-ggplot2.png" height="200px" /><img src="http://d33wubrfki0l68.cloudfront.net/071952491ec4a6a532a3f70ecfa2507af4d341f9/c167c/images/hex-dplyr.png" height="200px" /><img src="http://d33wubrfki0l68.cloudfront.net/5f8c22ec53a1ac61684f3e8d59c623d09227d6b9/b15de/images/hex-tidyr.png" height="200px" /><img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" height="200px" /><img src="http://d33wubrfki0l68.cloudfront.net/9221ddead578362bd17bafae5b85935334984429/37a68/images/hex-purrr.png" height="200px" /><img src="http://d33wubrfki0l68.cloudfront.net/f55c43407ae8944b985e2547fe868e5e2b3f9621/720bb/images/hex-tibble.png" height="200px" /> --- .pull-left5[ # What is Wrangling? Transform - Adding new columns - Combining columns - Splitting columns Organise - Moving data between columns and rows - Merging several dataframes - Sorting data by columns Aggregation - Aggregate data according to variables - Summarizing data across groups ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/organise_transform_aggregate.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left35[ # dplyr `dplyr` is a combination of 3 things: 1. **`objects`** like dataframes 2. **`functions`** that **do** things to objects. 3. **`pipes`** `%>%` that string together objects and verbs <br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/pipe.jpg" width="70%" style="display: block; margin: auto;" /> ] .pull-right6[ <br><br><br><br> ## The pipe %>% <br2> `dplyr` makes extensive use of the 'pipe' `%>%` which passes objects between functions. <br><br> ```r data %>% # Start with data, AND THEN... FUN1 %>% # Do FUN1, AND THEN... FUN2 %>% # Do FUN2, AND THEN... FUN3 %>% # Do FUN3, AND THEN... group_by(x, y) %>% # Group by variables x, y summarise( VAR_A_New = FUN4(X), VAR_B_New = FUN5(Y), VAR_C_New = FUN6(Z), ) ) ``` ] --- .pull-left4[ # Transformation Functions | Function| Description| |:-------------|:----| | `rename()` | Change column names | | `mutate()`| Create a new column from existing columns| | `case_when()`| Recode values from a vector to another| | `left_join()` | Combine multiple dataframes| ] .pull-right55[ <br><br><br><br> ### patients_df ```r patients_df # Demographic data ``` ``` ## # A tibble: 5 x 3 ## id b c ## <dbl> <dbl> <dbl> ## 1 1. 37. 1. ## 2 2. 65. 2. ## 3 3. 57. 2. ## 4 4. 34. 1. ## 5 5. 45. 2. ``` ] --- .pull-left55[ # Organisation Functions Organisation functions help you shuffle your data by sorting rows by columns, filter rows based on criteria, select columns (etc.) | Function| Purpose|Example| |:--------|:----|:------------| | `arrange()` |Sort rows by columns|`df %>%`<br>`arrange(arm, age)`| | `slice()`| Select rows by location|`df %>%`<br>`slice(1:10)`| | `filter()` | Select specific rows by criteria|`df %>%`<br>`filter(age > 50)`| | `select()`| Select specific columns|`df %>%`<br>`select(arm, t1)`| ] .pull-right4[ <br><br> Organise the `combined_df` dataframe ```r combined_df %>% # Sort rows by arm then in # descending order of age arrange(arm, desc(age)) %>% # Only include age > 50 filter(age > 50) %>% # Select these columns select(arm, age, t1, t2) ``` ] --- .pull-left4[ # Aggregation functions | Function| action| |:-------------|:----| | `group_by()` |Group data by levels of one or more variables | | `summarise()` | Calculate summary statistics | ### Statistical functions | Function| action| |:-------------|:----| | `min(), max()`| Minimum, maximum | | `mean(), median()` |Mean, Median | | `sd()` |Standard deviation| | `sum()` | Sum| | `n()`| Number of cases| ] .pull-right55[ <br> ```r # First 2 rows of raw data combined_df %>% slice(1:2) # Group and summarise! combined_df %>% group_by(arm) %>% summarise( N = n(), age_mean = mean(age), t1_mean = mean(t1, na.rm = TRUE), t2_mean = mean(t2, na.rm = TRUE) ) ``` ] --- ## If you want to do statistics in R, there is a package for that! .pull-left5[ <br> | Package| Models| |------:|:----| | `stats`|Generalized linear model| | `afex`| Anovas| | `lme4`| Mixed effects regression| | `rpart`| Decision Trees| | `BayesFactor`| Bayesian statistics| | `igraph`| Network analysis| | `neuralnet`| Neural networks| | `MatchIt`| Matching and causal inference| | `survival`| Longitudinal survival analysis| | ...| Anything you can ever want!| ] .pull-right45[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/statistical_procedures.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customising formulas You can keep adding terms to formulas by "adding"" them with `+` ```r # Include multiple terms with + my_glm <- glm(formula = income ~ food + alcohol + happiness + hiking, data = baselers) ``` To include *all* variables in a dataframe, use the catch-all notation <b>`formula = y ~ .`</b> ```r # Use y ~ . to include ALL variables my_glm <- glm(formula = income ~ ., data = baselers) ``` To specify *interaction terms* use `x1:x2` or `x1 * x2` instead of `x1 + x2` ```r # Include an interaction term between food and alcohol my_glm <- glm(formula = income ~ food * alcohol, data = baselers) ``` --- # Today <p><font size=6><b><a href="https://therbootcamp.github.io/Erfurt_2018June">Schedule</a>