Wrangling

# Wrangling
### Intro to data visualization with ggplot2 <a href='https://therbootcamp.github.io'> The R Bootcamp </a> <a href='https://therbootcamp.github.io/SDGDataViz_2021Nov/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/27090302'> </a>
### November 2021

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 therbootcamp.github.io
 
 
 </a>
 <a href="therbootcamp.github.io">
 
 Intro to data visualization with ggplot2 | Novemeber 2021
 
 </a>
 
 </div>

---

<!---
.pull-left45[

# What is "Wrangling"?

<ul>
 <li class="m1"><high>Transform</high>
 
 <ul class="level">
 <li>Change column names</li>
 <li>Create new variables</li>
 </ul></li>
 <li class="m2"><high>Organize</high>
 
 <ul class="level">
 <li>Sort rows</li>
 <li>Join data sets</li>
 <li>Transpose data</li>
 </ul></li>
 <li class="m3"><high>Aggregate</high>
 
 <ul class="level">
 <li>Build groups</li>
 <li>Calculate statistics</li>
 </ul></li>
</ul>

]

]

--->

# Tidyverse

<ul>
 <li class="m1">The tidyverse is...</li> 
 <ul class="level">
 <li>A collection of user-friendly <high>packages</high> for analyzing <high>tidy data</high></li> 
 <li>An <high>ecosystem</high> for analytics and data science with common design principles</li> 
 <li>A <high>dialect</high> of the R language</li>
 </ul>
</ul>

]

---

# <mono>%>%</mono>

<ul>
 <li class="m1">The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</li>
</ul>

]

```r
# Numeric vector
score <- c(8, 4, 6, 3, 7, 3)
score
```

```
## [1] 8 4 6 3 7 3
```

```r
# Mean: Base-R-style
mean(score)
```

```
## [1] 5.167
```

```r
# Mean: dplyr-style
score %>%  
  mean()  
```

```
## [1] 5.167
```

]

---

# <mono>%>%</mono>

]

]

---

# <mono>readr</mono>

<ul>
 <li class="m1">Benefits over <mono>read.csv</mono>:</li>
 <ul class="level">
 <li>Better type inference</li>
 <li>Avoids <mono>factors</mono></li>
 <li>Produces <highm>tibble</highm></li>
 </ul></li>
</ul>

]

```r
# Read in sdg hackathon data
project_sdgs <- 
 read_csv("1_Data/sdg_hackathon_data.zip")

project_sdgs
```

```
## # A tibble: 276,262 × 13
## project_number project_title 
## <dbl> <chr> 
## 1 184044 A safe haven for …
## 2 184044 A safe haven for …
## 3 184044 A safe haven for …
## 4 184044 A safe haven for …
## 5 184044 A safe haven for …
## # … with 276,257 more rows, and 11
## # more variables:
## # keywords <chr>,
## # responsible_applicant <chr>,
## # start_date <date>,
## # end_date <date>,
## # university <chr>, …
```

]

---

# <mono>tibble</mono>

<ul>
 <li class="m1">Benefits over <mono>data.frame</mono>:</li>
 <ul class="level">
 <li><high>Better print</high>: More informative and cleaner</li>
 <li>More consistent subsetting</li>
 </ul></li>
</ul>

]

```r
# Read in taxation
project_sdgs <- 
 read_csv("1_Data/sdg_hackathon_data.zip")

project_sdgs
```

]

---

# <mono>dplyr</mono>

<ul>
 <li class="m1">Benefits over Base R:</li>
 <ul class="level">
 <li><high>No more brackets</high></li>
 <li><high>Data masking</high></li>
 <li>Tidy selection</li>
 <li>Intuitively named functions</li>
 </ul></li>
</ul>

]

<table cellspacing="0" cellpadding="0" class="clean_table" width="100%">
<col width="42%">
<col width="58%">
<tr>
<td>Key verbs</td>
<td>Purpose</td>
</tr>
<tr>
<td style="padding-top:20px">Transformation</td>
<td></td>
</tr>
<tr>
<td><mono>rename()</mono></td>
<td>Rename column names</td>
</tr>
<tr>
<td><mono>mutate()</mono></td>
<td>Create/change columns</td>
</tr>
<td style="padding-top:20px">Organization</td>
<td></td>
</tr>
<tr>
<td><mono>arrange()</mono></td>
<td>Sort</td>
</tr>
<tr>
<td><mono>select()</mono></td>
<td>Select variables</td>
</tr>
<tr>
<td><mono>slice()</mono>, <mono>filter()</mono></td>
<td>Select rows</td>
</tr>
<tr>
<td><mono>distinct()</mono></td>
<td>Retain unique cases</td>
</tr>
<tr>
<td><mono>left_join()</mono>, <mono>inner_join()</mono>, etc.</td>
<td>Join data sets</td>
</tr>
<td style="padding-top:20px">Aggregation</td>
<td></td>
</tr>
<tr>
<td><mono>summarize()</mono></td>
<td>Calculate statistics</td>
</tr>
<tr>
<td><mono>group()</mono></td>
<td>Summarize group-wise</td>
</tr>
</table>

]

---

# `mutate()`

```r
# Create new column
TIBBLE %>% 
  mutate(NAME1 = MUTATE_FUN(VAR1),
         NAME2 = MUTATE_FUN(VAR2))
```

]

```r
project_sdgs %>%
  
  # create year column
  mutate(year = year(start_date))
```

```
## # A tibble: 276,262 × 14
## project_number project_title 
## <dbl> <chr> 
## 1 184044 A safe haven for …
## 2 184044 A safe haven for …
## 3 184044 A safe haven for …
## 4 184044 A safe haven for …
## 5 184044 A safe haven for …
## 6 184044 A safe haven for …
## 7 184044 A safe haven for …
## 8 184044 A safe haven for …
## # … with 276,254 more rows, and 12
## # more variables:
## # keywords <chr>,
## # responsible_applicant <chr>,
## # start_date <date>,
## # end_date <date>, …
```

]

---

# `select()`

```r
# Select two columns
TIBBLE %>% 
  select(VAR1, VAR2)

# Select everything but 
TIBBLE %>% 
  select(-VAR1)
```

]

```r
project_sdgs %>%
  mutate(year = year(start_date)) %>% 
  
  # Select columns
  select(project_number, year)
```

```
## # A tibble: 276,262 × 2
## project_number year
## <dbl> <dbl>
## 1 184044 2019
## 2 184044 2019
## 3 184044 2019
## 4 184044 2019
## 5 184044 2019
## 6 184044 2019
## 7 184044 2019
## 8 184044 2019
## # … with 276,254 more rows
```

]

---

# `distinct()`

```r
# Retain distinct cases
TIBBLE %>% 
  distinct()

# Retain distinct cases of variable
TIBBLE %>% 
  distinct(VAR1)
```

]

```r
project_sdgs %>%
  mutate(year = year(start_date)) %>% 
  select(project_number, year) %>% 
  
  # Retain distinct cases
  distinct()
```

```
## # A tibble: 32,375 × 2
## project_number year
## <dbl> <dbl>
## 1 184044 2019
## 2 147530 2013
## 3 130596 2010
## 4 130258 2010
## 5 200917 2021
## 6 154235 2014
## 7 140013 2011
## 8 198085 2020
## # … with 32,367 more rows
```

]

---

# `summarize()`

```r
# Create new summary variables
TIBBLE %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
project_sdgs %>%
  mutate(year = year(start_date)) %>% 
  select(project_number, year) %>% 
  distinct() %>% 
  
  # Calculate statistics 
  summarize(n = n())
```

```
## # A tibble: 1 × 1
## n
## <int>
## 1 32375
```

]

---

# `group_by()`

```r
# Create grouped summary variables
TIBBLE %>%
  group_by(GROUP_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
project_sdgs %>%
  mutate(year = year(start_date)) %>% 
  select(project_number, year) %>% 
  distinct() %>% 
  
  # Calculate statistics 
  group_by(year) %>% 
  summarize(n = n())
```

```
## # A tibble: 20 × 2
## year n
## <dbl> <int>
## 1 2001 1
## 2 2003 3
## 3 2005 4
## 4 2006 32
## 5 2007 41
## 6 2008 442
## 7 2009 1506
## 8 2010 2036
## # … with 12 more rows
```

]

---

# `arrange()`

```r
# Sort ascending
TIBBLE %>%
  arrange(VAR1, VAR2)

# Sort descending w/ desc()
TIBBLE %>%
  arrange(desc(VAR1), VAR2)
```

]

```r
project_sdgs %>%
  mutate(year = year(start_date)) %>% 
  select(project_number, year) %>% 
  distinct() %>% 
  group_by(year) %>% 
  summarize(n = n()) %>%

# Descendig sort by year
  arrange(desc(year))
```

```
## # A tibble: 20 × 2
## year n
## <dbl> <int>
## 1 2022 19
## 2 2021 1679
## 3 2020 2401
## 4 2019 2643
## 5 2018 2594
## 6 2017 2810
## 7 2016 2700
## 8 2015 2839
## # … with 12 more rows
```

]

---

# `filter()`

```r
# Filter using logical comparisons
TIBBLE %>%
 filter(VAR1 == VAL1,
 VAR2 > VAL2,
 VAR3 < VAL3,
 VAR4 == VAL4 | VAR5 < VAL5)
```
]

```r
project_sdgs %>%
 mutate(year = year(start_date)) %>% 
 select(project_number, year) %>% 
 distinct() %>% 
 group_by(year) %>% 
 summarize(n = n()) %>% 
 
 # Filter out cases
 filter(year > 2008 & year < 2022)
```

```
## # A tibble: 13 × 2
## year n
## <dbl> <int>
## 1 2009 1506
## 2 2010 2036
## 3 2011 2098
## 4 2012 2806
## 5 2013 2822
## 6 2014 2899
## 7 2015 2839
## 8 2016 2700
## # … with 5 more rows
```

]

---

# `*_join()`

```r
# Join two tibbles
TIBBLE1 %>%
  left_join(TIBBLE2, 
            by = c("KEY1" = "KEY2"))

# Join two tibbles
TIBBLE1 %>%
  right_join(TIBBLE2, 
            by = c("KEY1" = "KEY2"))
```

]

```r
# Create year variable
project_sdgs = project_sdgs %>%
  mutate(year = year(start_date))

# Count projects per year
project_sdgs %>% 
  select(project_number, year) %>% 
  distinct() %>% 
  group_by(year) %>% 
  summarize(n = n()) %>%

# Join back to project_sdgs tibble
  right_join(project_sdgs, by = "year")
```

```
## # A tibble: 276,262 × 15
## year n project_number
## <dbl> <int> <dbl>
## 1 2001 1 65917
## 2 2001 1 65917
## 3 2001 1 65917
## 4 2001 1 65917
## 5 2001 1 65917
## 6 2001 1 65917
## # … with 276,256 more rows, and 12
## # more variables:
## # project_title <chr>,
## # keywords <chr>,
## # responsible_applicant <chr>, …
```
]

<!---

# <mono>tidyr</mono>

<ul>
 <li class="m1">Benefits over Base R:</li>
 <ul class="level">
 <li>Did not exist before.</li>
 </ul></li>
</ul>

]

<img src="https://github.com/gadenbuie/tidyexplain/raw/master/images/tidyr-spread-gather.gif" height=420px> 
adapted from <a href="https://github.com/gadenbuie/tidyexplain">tidyexplain</a>

]

# `pivot_longer()`

```r
# wide to long
TIBBLE %>% 
  pivot_longer(cols = VARS,
               names_to = NAME1,
               values_to = NAME2)
```

]

```r
# wide to long
basel %>% 
  select(year, quarter, 
         income_mean, wealth_mean) %>% 
  pivot_longer(c(income_mean, wealth_mean))
```
]

--->

---

<h1><a href="https://therbootcamp.github.io/SDGDataViz_2021Nov">Schedule</a></h1>