class: center, middle, inverse, title-slide # Wiederholung ### Explorative Datenanalyse mit R
The R Bootcamp
### September 2020 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Explorative Datenanalyse mit R | September 2020 </font> </a> </span> </div> --- # 3 Klassen von Datenobjekten .pull-left4[ <high>`list`</high> - R's Mehrzweck-Container - <span>Kann alle Daten beinhalten, inkl. `list`s</span> - <span>Nützlich für komplexe Funktionsoutputs</span> <high>`data_frame`</high>, <high>`tibble`</high> - R's Tabelle - <span>Spezialfall einer `list`</span> - <span>R's `Tidy`-Format für Daten <high>`vector`</high> - R's Daten-Container - <span>Primärer Daten-Container</span> - <span>Beinhaltet daten von genau einem Klasse</span> ] .pull-right55[ <img src="image/main_objects.png"></img> ] --- # Datentypen ausserhalb von R <table class="tg" style="cellspacing:0; cellpadding:0; border:none;" width="95%"> <col width=30%> <col width=30%> <col width=30%> <tr> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m1"><span><high>Strukturierte Daten</high> <ul class="level"> <li><span>Delimiter getrennt: <mono>.csv</mono>, <mono>.txt</mono>, etc.</span></li> <li><span>Relationale Datenbanken: <mono>SQL</mono></span></li> </ul> <br><img src="image/structured.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m2"><span><high>Semi-strukturierte Daten</high> <ul class="level"> <li><span>Markup: <mono>.xml</mono>, <mono>.xls</mono>, <mono>.html</mono> etc.</span></li> <li><span>Non markup: <mono>JSON</mono>, <mono>MongoDB</mono></span></li> </ul> <br><img src="image/html.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m3"><span><high>Unstrukturierte Daten</high> <ul class="level"> <li><span>z.B. Text</span></li> </ul> <br><br><br><br><br2><img src="image/text.png" height=250px> </span></li> </ul> </td> </tr> </table> --- # Delimiter getrennte Daten .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separieren die Spalten.</span></li> <li class="m2"><span>Meist als <high>lokale Textdatei</high> vorliegend.</span></li> <li class="m3"><span><high>Datenklassen</high> werden inferiert.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") # Benutze expliziten Delimiter basel <- read_delim("1_Data/basel.csv", delim = ",") basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with 9,995 more rows, and 16 ## # more variables ``` ] --- # Das mächtige <mono>tidyverse</mono> <ul> <li class="m1"><span>Das <a href="https://www.tidyverse.org/"><mono>tidyverse</mono></a> ist im Kern eine Sammlung hoch-performanter, nutzerfreundlicher Pakete, die speziell für eine effizientere Datenanalyse entwickelt wurden.</span></li> </ul> <ol style="padding-left:72px"> <li><mono>ggplot2</mono> für Grafiken.</li> <li><high><mono>dplyr</mono> für Datenverarbeitung</high>.</li> <li><high><mono>tidyr</mono> für Datenverarbeitung</high>.</li> <li><mono>readr</mono> für Daten I/O.</li> <li><mono>purrr</mono> für funktionales Programmieren.</li> <li><mono>tibble</mono> für moderne <mono>data_frame</mono>'s.</li> </ol> <table style="cellspacing:0; cellpadding:0; border:none;padding-top:20px"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <tr> <td bgcolor="white"> <img src="image/hex-ggplot2.png" height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-dplyr.png"height=160px></img> </td> <td bgcolor="white"> <img src="image/hex-tidyr.png"height=160px></img> </td> <td bgcolor="white"> <img src="image/hex-readr.png"height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-purrr.png"height=160px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-tibble.png"height=160px style="opacity:.2"></img> </td> </tr> </table> --- .pull-left45[ # Was ist Wrangling? <ul> <li class="m1"><span><high>Transformieren</high> <br><br> <ul class="level"> <li><span>Spaltennamen verändern</span></li> <li><span>Neue Variablen kreieren</span></li> </ul></span></li> <li class="m2"><span><high>Organisieren</high> <br><br> <ul class="level"> <li><span>Sortieren</span></li> <li><span>Datensätze zusammenführen</span></li> <li><span>Spalten zu Zeilen flippen</span></li> </ul></span></li> <li class="m3"><span><high>Aggregieren</high> <br><br> <ul class="level"> <li><span>Datengruppen bilden</span></li> <li><span>Statistiken für Gruppen berechnen</span></li> </ul></span></li> </ul> ] .pull-right5[ <br> <p align="center"> <img src="image/wrangling.png" height = "530px"> </p> ] --- # <mono>%>%</mono> .pull-left45[ <ul> <li class="m1"><span>Der präferierte Gebrauch von <mono>dplyr</mono> beinhaltet einen <high>neuen Operator</high>, die Pipe <highm>%>%</highm>.</span></li> </ul> <br> <p align="center"> <img src="image/pipe.jpg" width = "300px"><br> <font style="font-size:10px">from <a href="https://upload.wikimedia.org/wikipedia/en/thumb/b/b9/MagrittePipe.jpg">wikimedia.org</a></font> </p> ] .pull-right45[ <p align="center"> <img src="image/pipe.png" height = "400px"> </p> ] --- .pull-left4[ # Transformation <ul> <li class="m1"><span><high>Umbenennen</high>: Intuitive Spaltennamen vergeben. <br><br> <ul class="level"> <li><span><mono>rename()</mono></span></li> </ul> </span></li> <li class="m2"><span><high>Umkodieren</high>: Angemessene Einheiten und Datenlabels vergeben. <br><br> <ul class="level"> <li><span><mono>mutate()</mono></span></li> <li><span><mono>case_when()</mono></span></li> </ul> </span></li> <li class="m3"><span><high>Verbinden</high>: Datensätze zusammenführen. <br><br> <ul class="level"> <li><span><mono>left_join()</mono></span></li> </ul> </span></li> </ul> ] .pull-right45[ <br> ```r patienten ``` ``` ## # A tibble: 5 x 3 ## id X1 X2 ## <dbl> <dbl> <dbl> ## 1 1 37 1 ## 2 2 65 2 ## 3 3 57 2 ## 4 4 34 1 ## 5 5 45 2 ``` ```r ergebnisse ``` ``` ## # A tibble: 5 x 3 ## id t_1 t_2 ## <dbl> <dbl> <dbl> ## 1 4 100 105 ## 2 92 134 150 ## 3 1 123 135 ## 4 2 143 140 ## 5 99 102 68 ``` ] --- # Organisation .pull-left4[ <ul> <li class="m4"><span><high>Sortieren</high>: Datensatz ordnen. <br><br> <ul class="level"> <li><span><mono>arrange()</mono></span></li> </ul> </span></li> <li class="m5"><span><high>Filtern</high>: Relevante Fälle auswählen. <br><br> <ul class="level"> <li><span><mono>slice()</mono></span></li> <li><span><mono>filter()</mono></span></li> </ul> </span></li> <li class="m6"><span><high>Auswählen</high>: Relevante Variablen auswählen. <br><br> <ul class="level"> <li><span><mono>select()</mono></span></li> </ul> </span></li> </ul> ] .pull-right55[ ```r # Verbundener tibble patienten_ergebnisse ``` ``` ## # A tibble: 5 x 6 ## id alter bedingung bed_label ## <dbl> <dbl> <dbl> <chr> ## 1 1 37 1 placebo ## 2 2 65 2 medikame… ## 3 3 57 2 medikame… ## 4 4 34 1 placebo ## 5 5 45 2 medikame… ## # … with 2 more variables ``` ] --- # Aggregation .pull-left4[ <ul> <li class="m1"><span><high>Aggregation</high> <br><br> <ul class="level"> <li><span><mono>summarise()</mono></span></li> <li><span><mono>summarise_if()</mono></span></li> <li><span><mono>group_by(), summarise()</mono></span></li> <li><span><mono>n(), first(), last(), nth()</mono></span></li> <li><span><mono>pull()</mono></span></li> </ul> </span></li> </ul> ] .pull-right5[ ```r basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse gewicht ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 f 87 165 NA ## 2 2 m 54 175. 85.6 ## 3 3 f 34 147. 53.9 ## 4 4 m 31 166. 105 ## 5 5 m 24 180. 102. ## # … with 9,995 more rows, and 15 more ## # variables: einkommen <dbl>, ## # bildung <chr>, konfession <chr>, ## # kinder <dbl>, glueck <dbl>, ## # fitness <dbl>, essen <dbl>, ## # alkohol <dbl>, tattoos <dbl>, ## # rhein <dbl>, … ``` ] --- class: middle, center <h1><a href=https://therbootcamp.github.io/EDA_2020Feb/index.html>Agenda</a></h1>