class: center, middle, inverse, title-slide # Wrangling ### Intro to data visualization with ggplot2
The R Bootcamp
### November 2021 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> therbootcamp.github.io </font> </span> </a> <a href="therbootcamp.github.io"> <font color="#7E7E7E"> Intro to data visualization with ggplot2 | Novemeber 2021 </font> </a> </span> </div> --- <!--- .pull-left45[ # What is "Wrangling"? <ul> <li class="m1"><span><high>Transform</high> <br><br> <ul class="level"> <li><span>Change column names</span></li> <li><span>Create new variables</span></li> </ul></span></li> <li class="m2"><span><high>Organize</high> <br><br> <ul class="level"> <li><span>Sort rows</span></li> <li><span>Join data sets</span></li> <li><span>Transpose data</span></li> </ul></span></li> <li class="m3"><span><high>Aggregate</high> <br><br> <ul class="level"> <li><span>Build groups</span></li> <li><span>Calculate statistics</span></li> </ul></span></li> </ul> ] .pull-right5[ <br> <p align="center"> <img src="image/wrangling_eng.png" height = "530px"> </p> ] ---> .pull-left3[ # Tidyverse <ul> <li class="m1"><span>The tidyverse is...</span></li><br> <ul class="level"> <li><span>A collection of user-friendly <high>packages</high> for analyzing <high>tidy data</high></span></li><br> <li><span>An <high>ecosystem</high> for analytics and data science with common design principles</span></li><br> <li><span>A <high>dialect</high> of the R language</span></li> </ul> </ul> ] .pull-right65[ <br><br> <p align="center"> <img src="image/tidyverse_wrangling.png" height = "520px"> </p> ] --- # <mono>%>%</mono> .pull-left45[ <ul> <li class="m1"><span>The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</span></li> </ul> <br> <p align="center"> <img src="image/magrittr_hex.png" height = "280px"> </p> ] .pull-right45[ ```r # Numeric vector score <- c(8, 4, 6, 3, 7, 3) score ``` ``` ## [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-style mean(score) ``` ``` ## [1] 5.167 ``` ```r # Mean: dplyr-style score %>% mean() ``` ``` ## [1] 5.167 ``` ] --- # <mono>%>%</mono> .pull-left45[ <ul> <li class="m1"><span>The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</span></li> </ul> <br> <p align="center"> <img src="image/magrittr_hex.png" height = "280px"> </p> ] .pull-right45[ <p align="center"> <img src="image/pipe_en.png" height = "400px"> </p> ] --- # <mono>readr</mono> .pull-left45[ <ul> <li class="m1"><span>Benefits over <mono>read.csv</mono>:</span></li> <ul class="level"> <li><span>Better type inference</span></li> <li><span>Avoids <mono>factors</mono></span></li> <li><span>Produces <highm>tibble</highm></span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/readr.png" height=240px> </p> ] .pull-right45[ ```r # Read in sdg hackathon data project_sdgs <- read_csv("1_Data/sdg_hackathon_data.zip") project_sdgs ``` ``` ## # A tibble: 276,262 × 13 ## project_number project_title ## <dbl> <chr> ## 1 184044 A safe haven for … ## 2 184044 A safe haven for … ## 3 184044 A safe haven for … ## 4 184044 A safe haven for … ## 5 184044 A safe haven for … ## # … with 276,257 more rows, and 11 ## # more variables: ## # keywords <chr>, ## # responsible_applicant <chr>, ## # start_date <date>, ## # end_date <date>, ## # university <chr>, … ``` ] --- # <mono>tibble</mono> .pull-left45[ <ul> <li class="m1"><span>Benefits over <mono>data.frame</mono>:</span></li> <ul class="level"> <li><span><high>Better print</high>: More informative and cleaner</span></li> <li><span>More consistent subsetting</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/tibble.png" height=240px> </p> ] .pull-right45[ ```r # Read in taxation project_sdgs <- read_csv("1_Data/sdg_hackathon_data.zip") project_sdgs ``` ``` ## # A tibble: 276,262 × 13 ## project_number project_title ## <dbl> <chr> ## 1 184044 A safe haven for … ## 2 184044 A safe haven for … ## 3 184044 A safe haven for … ## 4 184044 A safe haven for … ## 5 184044 A safe haven for … ## # … with 276,257 more rows, and 11 ## # more variables: ## # keywords <chr>, ## # responsible_applicant <chr>, ## # start_date <date>, ## # end_date <date>, ## # university <chr>, … ``` ] --- .pull-left45[ # <mono>dplyr</mono> <ul> <li class="m1"><span>Benefits over Base R:</span></li> <ul class="level"> <li><span><high>No more brackets</high></span></li> <li><span><high>Data masking</high></span></li> <li><span>Tidy selection</span></li> <li><span>Intuitively named functions</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/dplyr.png" height=240px> </p> ] .pull-right5[ <br><br> <table cellspacing="0" cellpadding="0" class="clean_table" width="100%"> <col width="42%"> <col width="58%"> <tr> <td><b>Key verbs</b></td> <td><b>Purpose</b></td> </tr> <tr> <td style="padding-top:20px"><i>Transformation</i></td> <td></td> </tr> <tr> <td><mono>rename()</mono></td> <td>Rename column names</td> </tr> <tr> <td><mono>mutate()</mono></td> <td>Create/change columns</td> </tr> <td style="padding-top:20px"><i>Organization</i></td> <td></td> </tr> <tr> <td><mono>arrange()</mono></td> <td>Sort</td> </tr> <tr> <td><mono>select()</mono></td> <td>Select variables</td> </tr> <tr> <td><mono>slice()</mono>, <mono>filter()</mono></td> <td>Select rows</td> </tr> <tr> <td><mono>distinct()</mono></td> <td>Retain unique cases</td> </tr> <tr> <td><mono>left_join()</mono>, <mono>inner_join()</mono>, etc.</td> <td>Join data sets</td> </tr> <td style="padding-top:20px"><i>Aggregation</i></td> <td></td> </tr> <tr> <td><mono>summarize()</mono></td> <td>Calculate statistics</td> </tr> <tr> <td><mono>group()</mono></td> <td>Summarize group-wise</td> </tr> </table> ] --- # `mutate()` .pull-left4[ ```r # Create new column TIBBLE %>% mutate(NAME1 = MUTATE_FUN(VAR1), NAME2 = MUTATE_FUN(VAR2)) ``` ] .pull-right5[ ```r project_sdgs %>% # create year column mutate(year = year(start_date)) ``` ``` ## # A tibble: 276,262 × 14 ## project_number project_title ## <dbl> <chr> ## 1 184044 A safe haven for … ## 2 184044 A safe haven for … ## 3 184044 A safe haven for … ## 4 184044 A safe haven for … ## 5 184044 A safe haven for … ## 6 184044 A safe haven for … ## 7 184044 A safe haven for … ## 8 184044 A safe haven for … ## # … with 276,254 more rows, and 12 ## # more variables: ## # keywords <chr>, ## # responsible_applicant <chr>, ## # start_date <date>, ## # end_date <date>, … ``` ] --- # `select()` .pull-left4[ ```r # Select two columns TIBBLE %>% select(VAR1, VAR2) # Select everything but TIBBLE %>% select(-VAR1) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% # Select columns select(project_number, year) ``` ``` ## # A tibble: 276,262 × 2 ## project_number year ## <dbl> <dbl> ## 1 184044 2019 ## 2 184044 2019 ## 3 184044 2019 ## 4 184044 2019 ## 5 184044 2019 ## 6 184044 2019 ## 7 184044 2019 ## 8 184044 2019 ## # … with 276,254 more rows ``` ] --- # `distinct()` .pull-left4[ ```r # Retain distinct cases TIBBLE %>% distinct() # Retain distinct cases of variable TIBBLE %>% distinct(VAR1) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% select(project_number, year) %>% # Retain distinct cases distinct() ``` ``` ## # A tibble: 32,375 × 2 ## project_number year ## <dbl> <dbl> ## 1 184044 2019 ## 2 147530 2013 ## 3 130596 2010 ## 4 130258 2010 ## 5 200917 2021 ## 6 154235 2014 ## 7 140013 2011 ## 8 198085 2020 ## # … with 32,367 more rows ``` ] --- # `summarize()` .pull-left4[ ```r # Create new summary variables TIBBLE %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% select(project_number, year) %>% distinct() %>% # Calculate statistics summarize(n = n()) ``` ``` ## # A tibble: 1 × 1 ## n ## <int> ## 1 32375 ``` ] --- # `group_by()` .pull-left4[ ```r # Create grouped summary variables TIBBLE %>% group_by(GROUP_VAR) %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% select(project_number, year) %>% distinct() %>% # Calculate statistics group_by(year) %>% summarize(n = n()) ``` ``` ## # A tibble: 20 × 2 ## year n ## <dbl> <int> ## 1 2001 1 ## 2 2003 3 ## 3 2005 4 ## 4 2006 32 ## 5 2007 41 ## 6 2008 442 ## 7 2009 1506 ## 8 2010 2036 ## # … with 12 more rows ``` ] --- # `arrange()` .pull-left4[ ```r # Sort ascending TIBBLE %>% arrange(VAR1, VAR2) # Sort descending w/ desc() TIBBLE %>% arrange(desc(VAR1), VAR2) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% select(project_number, year) %>% distinct() %>% group_by(year) %>% summarize(n = n()) %>% # Descendig sort by year arrange(desc(year)) ``` ``` ## # A tibble: 20 × 2 ## year n ## <dbl> <int> ## 1 2022 19 ## 2 2021 1679 ## 3 2020 2401 ## 4 2019 2643 ## 5 2018 2594 ## 6 2017 2810 ## 7 2016 2700 ## 8 2015 2839 ## # … with 12 more rows ``` ] --- # `filter()` .pull-left4[ ```r # Filter using logical comparisons TIBBLE %>% filter(VAR1 == VAL1, VAR2 > VAL2, VAR3 < VAL3, VAR4 == VAL4 | VAR5 < VAL5) ``` ] .pull-right5[ ```r project_sdgs %>% mutate(year = year(start_date)) %>% select(project_number, year) %>% distinct() %>% group_by(year) %>% summarize(n = n()) %>% # Filter out cases filter(year > 2008 & year < 2022) ``` ``` ## # A tibble: 13 × 2 ## year n ## <dbl> <int> ## 1 2009 1506 ## 2 2010 2036 ## 3 2011 2098 ## 4 2012 2806 ## 5 2013 2822 ## 6 2014 2899 ## 7 2015 2839 ## 8 2016 2700 ## # … with 5 more rows ``` ] --- .pull-left4[ # `*_join()` ```r # Join two tibbles TIBBLE1 %>% left_join(TIBBLE2, by = c("KEY1" = "KEY2")) # Join two tibbles TIBBLE1 %>% right_join(TIBBLE2, by = c("KEY1" = "KEY2")) ``` ] .pull-right5[ <br> ```r # Create year variable project_sdgs = project_sdgs %>% mutate(year = year(start_date)) # Count projects per year project_sdgs %>% select(project_number, year) %>% distinct() %>% group_by(year) %>% summarize(n = n()) %>% # Join back to project_sdgs tibble right_join(project_sdgs, by = "year") ``` ``` ## # A tibble: 276,262 × 15 ## year n project_number ## <dbl> <int> <dbl> ## 1 2001 1 65917 ## 2 2001 1 65917 ## 3 2001 1 65917 ## 4 2001 1 65917 ## 5 2001 1 65917 ## 6 2001 1 65917 ## # … with 276,256 more rows, and 12 ## # more variables: ## # project_title <chr>, ## # keywords <chr>, ## # responsible_applicant <chr>, … ``` ] <!--- # <mono>tidyr</mono> .pull-left4[ <ul> <li class="m1"><span>Benefits over Base R:</span></li> <ul class="level"> <li><span>Did not exist before.</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/tidyr.png" height=240px> </p> ] .pull-right5[ <p align="center"> <img src="https://github.com/gadenbuie/tidyexplain/raw/master/images/tidyr-spread-gather.gif" height=420px><br> <font style="font-size:10px">adapted from <a href="https://github.com/gadenbuie/tidyexplain">tidyexplain</a></font> </p> ] # `pivot_longer()` .pull-left4[ ```r # wide to long TIBBLE %>% pivot_longer(cols = VARS, names_to = NAME1, values_to = NAME2) ``` ] .pull-right5[ ```r # wide to long basel %>% select(year, quarter, income_mean, wealth_mean) %>% pivot_longer(c(income_mean, wealth_mean)) ``` ] ---> --- class: middle, center <h1><a href="https://therbootcamp.github.io/SDGDataViz_2021Nov">Schedule</a></h1>