Wrangling II

# Wrangling II
### Explorative Datenanalyse mit R <a href='https://therbootcamp.github.io'>The R Bootcamp</a> <a href='https://therbootcamp.github.io/EDA_2022Mar/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### März 2022

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 Explorative Datenanalyse mit R | März 2022
 
 </a>
 
 </div>

---

# Noch mehr `dplyr`

<ul>
 <li class="m1"><high>Transformation & Organisation</high>
 
 <ul class="level">
 <li>Fehlende Werte ersetzen / entfernen</li>
 <li>Ändere alle Variablen, die...</li>
 <li>Zeilen zu Spalten oder Spalten zu Zeilen</li>
 </ul>
 </li>
 <li class="m2"><high>Aggregation</high>
 
 <ul class="level">
 <li>Nach Variablen gruppieren</li>
 <li>Deskriptive Statistiken berechnen</li>
 </ul>
 </li>
</ul>

]

.pull-right45[

<img src="image/wrangling.jpeg"> 
from <a href="https://DATENsciencebe.com/tag/DATEN-wrangling/">DATENsciencebe.com</a>

]

---

# Transformation & Organisation

<ul>
 <li class="m1"><high>Transformation</high>
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2"><high>Organisation</high>
 
 <ul class="level">
 <li><mono>starts_with(), contains(), :</mono></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse
```

```
## # A tibble: 5 × 6
## id alter bedingung bed_label t_1 t_2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 2 65 2 medikament 143 140
## 3 3 57 2 medikament NA NA
## 4 4 34 1 placebo 100 105
## 5 5 45 2 medikament NA NA
```

]

---

# `mutate_if()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><high><mono>mutate_if()</mono></high></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><mono>starts_with(), contains(), :</mono></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Ändere alle numeric in character
  mutate_if(is.numeric, as.character)
```

```
## # A tibble: 5 × 6
## id alter bedingung bed_label t_1 t_2 
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 37 1 placebo 123 135 
## 2 2 65 2 medikament 143 140 
## 3 3 57 2 medikament <NA> <NA> 
## 4 4 34 1 placebo 100 105 
## 5 5 45 2 medikament <NA> <NA>
```

]

---

# `replace_na()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><high><mono>replace_na()</mono></high></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><mono>starts_with(), contains(), :</mono></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Ändere fehlende Werte in 110
  mutate(t_1 = replace_na(t_1, 110))
```

```
## # A tibble: 5 × 6
## id alter bedingung bed_label t_1 t_2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 2 65 2 medikament 143 140
## 3 3 57 2 medikament 110 NA
## 4 4 34 1 placebo 100 105
## 5 5 45 2 medikament 110 NA
```

]

---

# `replace_na()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><high><mono>drop_na()</mono></high></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><mono>starts_with(), contains(), :</mono></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Entferne Zeilen mit fehlenden Werten
  drop_na()
```

```
## # A tibble: 3 × 6
## id alter bedingung bed_label t_1 t_2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 2 65 2 medikament 143 140
## 3 4 34 1 placebo 100 105
```

]

---

# `starts_with()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><high><mono>starts_with(), contains(), :</mono></high></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Wähle spalten aus, die mit "t" beginnen
  select(starts_with("t"))
```

```
## # A tibble: 5 × 2
## t_1 t_2
## <dbl> <dbl>
## 1 123 135
## 2 143 140
## 3 NA NA
## 4 100 105
## 5 NA NA
```

]

---

# `contains()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><high><mono>starts_with(), contains(), :</mono></high></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Wähle spalten aus, die "_" beinhalten
  select(contains("_"))
```

```
## # A tibble: 5 × 3
## bed_label t_1 t_2
## <chr> <dbl> <dbl>
## 1 placebo 123 135
## 2 medikament 143 140
## 3 medikament NA NA
## 4 placebo 100 105
## 5 medikament NA NA
```

]

---

# `:`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><high><mono>starts_with(), contains(), :</mono></high></li>
 <li><mono>pivot_longer(), pivot_wider()</mono></li>
 </ul>
 </li>
</ul>

]

```r
patienten_ergebnisse %>%
  
  # Spalten von alter bis t_1
  select(alter:t_1)
```

```
## # A tibble: 5 × 4
## alter bedingung bed_label t_1
## <dbl> <dbl> <chr> <dbl>
## 1 37 1 placebo 123
## 2 65 2 medikament 143
## 3 57 2 medikament NA
## 4 34 1 placebo 100
## 5 45 2 medikament NA
```

]

---

# `pivot_*()`

<ul>
 <li class="m1">Transformation
 
 <ul class="level">
 <li><mono>mutate_if()</mono></li>
 <li><mono>replace_na()</mono></li>
 <li><mono>drop_na()</mono></li>
 </ul>
 </li>
 <li class="m2">Organisation
 
 <ul class="level">
 <li><mono>starts_with(), contains(), :</mono></li>
 <li><high><mono>pivot_longer(), pivot_wider()</mono></high></li>
 </ul>
 </li>
</ul>

]

<img src="image/tidyr-spread-gather.gif" height=420px> 
adapted from <a href="https://github.com/gadenbuie/tidyexplain">tidyexplain</a>

]

---

# `pivot_longer()`

```r
# wide zu long
TIBBLE %>% 
  pivot_longer(cols = VARS,
               names_to = NAME1,
               values_to = NAME2)
```

]

```r
# wide zu long
patienten_ergebnisse %>% 
  filter(bed_label == "placebo")
```

```
## # A tibble: 2 × 6
## id alter bedingung bed_label t_1 t_2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 4 34 1 placebo 100 105
```
]

---

# `pivot_longer()`

```r
# wide zu long
TIBBLE %>% 
  pivot_longer(cols = VARS,
               names_to = NAME1,
               values_to = NAME2)
```

]

```r
# wide zu long
patienten_ergebnisse %>% 
  filter(bed_label == "placebo") %>%
  pivot_longer(cols = c("t_1", "t_2"),
               names_to = "zeit",
               values_to = "messung")
```

```
## # A tibble: 4 × 6
## id alter bedingung bed_label zeit messung
## <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 1 37 1 placebo t_1 123
## 2 1 37 1 placebo t_2 135
## 3 4 34 1 placebo t_1 100
## 4 4 34 1 placebo t_2 105
```

]

---

# `pivot_wider()`

```r
# wide zu long
TIBBLE %>% 
  pivot_wider(names_from = VAR1,
              values_from = VAR2)
```
]

```r
# wide zu long
patienten_ergebnisse_lang
```

]

---

# `pivot_wider()`

```r
# wide zu long
TIBBLE %>% 
  pivot_wider(names_from = VAR1,
              values_from = VAR2)
```
]

```r
# wide zu long
patienten_ergebnisse_lang %>%
    pivot_wider(names_from = "zeit",
                values_from = "messung")
```

```
## # A tibble: 2 × 6
## id alter bedingung bed_label t_1 t_2
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 1 37 1 placebo 123 135
## 2 4 34 1 placebo 100 105
```

]

---

# Aggregation

<ul>
 <li class="m1"><high>Aggregation</high>
 
 <ul class="level">
 <li><mono>summarise()</mono></li>
 <li><mono>summarise_if()</mono></li>
 <li><mono>group_by(), summarise()</mono></li>
 <li><mono>n(), first(), last(), nth()</mono></li>
 <li><mono>pull()</mono></li>
 </ul>
 </li>
</ul>

]

```r
basel
```

```
## # A tibble: 10,000 × 20
## id geschlecht alter groesse gewicht
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 f 87 165 NA 
## 2 2 m 54 175. 85.6
## 3 3 f 34 147. 53.9
## 4 4 m 31 166. 105 
## 5 5 m 24 180. 102. 
## # … with 9,995 more rows, and 15 more
## # variables: einkommen <dbl>,
## # bildung <chr>, konfession <chr>,
## # kinder <dbl>, glueck <dbl>,
## # fitness <dbl>, essen <dbl>,
## # alkohol <dbl>, tattoos <dbl>,
## # rhein <dbl>, …
```

]

---

# `summarise()`

```r
TIBBLE %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Berechne deskriptive Statistiken
  summarise(
    alter_mean = mean(alter),
    groesse_median = median(groesse)
  )
```

```
## # A tibble: 1 × 2
## alter_mean groesse_median
## <dbl> <dbl>
## 1 49.4 171.
```

]

---

# `summarise_if()`

```r
TIBBLE %>%
  summarise_if(
    BEDINGUNG,
    SUMMARY_FUN
  )
```

]

```r
basel %>%
  select(alter, groesse, konfession, einkommen)
```

```
## # A tibble: 10,000 × 4
## alter groesse konfession einkommen
## <dbl> <dbl> <chr> <dbl>
## 1 87 165 katholisch NA
## 2 54 175. konfessionslos 7500
## 3 34 147. konfessionslos 5500
## 4 31 166. katholisch NA
## 5 24 180. katholisch 3800
## # … with 9,995 more rows
```

]

---

# `summarise_if()`

```r
TIBBLE %>%
  summarise_if(
    BEDINGUNG,
    SUMMARY_FUN
  )
```

]

```r
basel %>%
  
  # Berechne deskriptive Statistiken
  select(alter, groesse, konfession, einkommen) %>%
  summarise_if(is.numeric, mean)
```

```
## # A tibble: 1 × 3
## alter groesse einkommen
## <dbl> <dbl> <dbl>
## 1 49.4 171. NA
```

]

---

# `summarise_if()`

```r
TIBBLE %>%
  summarise_if(
    BEDINGUNG,
    SUMMARY_FUN,
    ARGUMENTE
  )
```

]

```r
basel %>%
  
  # Berechne deskriptive Statistiken
  select(alter, groesse, konfession, einkommen) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)
```

```
## # A tibble: 1 × 3
## alter groesse einkommen
## <dbl> <dbl> <dbl>
## 1 49.4 171. 8355.
```

]

---

# Gruppierte Aggregation

---

# `group_by()`, `summarise()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(
    alter_mean = mean(alter),
    groesse_median = median(groesse)
  )
```

```
## # A tibble: 2 × 3
## geschlecht alter_mean groesse_median
## <chr> <dbl> <dbl>
## 1 f 49.8 164 
## 2 m 49.1 178.
```

]

---

# `n()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(
    N = n()
  )
```

```
## # A tibble: 2 × 2
## geschlecht N
## <chr> <int>
## 1 f 5000
## 2 m 5000
```

]

---

# `first()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(
    N = n(),
    bild_1 = first(bildung)
  )
```

```
## # A tibble: 2 × 3
## geschlecht N bild_1 
## <chr> <int> <chr> 
## 1 f 5000 obligatorisch
## 2 m 5000 sek III
```

]

---

# `last()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(
    N = n(),
    bild_1 = first(bildung),
    bild_N = last(bildung)
  )
```

```
## # A tibble: 2 × 4
## geschlecht N bild_1 bild_N
## <chr> <int> <chr> <chr> 
## 1 f 5000 obligatorisch lehre 
## 2 m 5000 sek III lehre
```

]

---

# `nth()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(
    N = n(),
    bild_1 = first(bildung),
    bild_N = last(bildung),
    bild_150 = nth(bildung, 150)
  )
```

```
## # A tibble: 2 × 5
## geschlecht N bild_1 bild_N bild_150 
## <chr> <int> <chr> <chr> <chr> 
## 1 f 5000 obligatorisch lehre obligator…
## 2 m 5000 sek III lehre lehre
```

]

---

# `pull()`

```r
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME = SUMMARY_FUN(VAR),
  ) %>%
  pull(NAME)
```

]

```r
basel %>%
  
  # Gruppiere nach geschlecht
  group_by(geschlecht) %>%
  
  # Berechne Statistiken
  summarise(N = n()) %>%
  
  # Extrahiere Vektor
  pull(N)
```

```
## [1] 5000 5000
```

]

---

<h1><a href="https://therbootcamp.github.io/EDA_2022Mar/_sessions/WranglingII/WranglingII_practical.html">Practical</a></h1>