class: center, middle, inverse, title-slide # Daten ### Explorative Datenanalyse mit R
The R Bootcamp
### März 2021 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Explorative Datenanalyse mit R | März 2021 </font> </a> </span> </div> --- <!--- .pull-left45[ # `Tidy` Daten <ul> <li class="m1"><span>Jede Spalte ist eine Variable.</span></li> <li class="m2"><span>Jede Zeile ist eine Beobachtung.</span></li> <li class="m3"><span>Jede Zelle beinhaltet nur einen Wert.</span></li> </ul> <font style="font-size:16px">...nach <a href="https://github.com/tidyverse/tidyr">Hadley Wickham</a></font> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] ---> # 3 Klassen von Datenobjekten .pull-left4[ <high>`list`</high> - R's Mehrzweck-Container - <span>Kann alle Daten beinhalten, inkl. `list`s</span> - <span>Nützlich für komplexe Funktionsoutputs</span> <high>`data.frame`</high> - R's Tabelle - <span>Spezialfall einer `list`</span> - <span>R's `Tidy`-Format für Daten <high>`vector`</high> - R's Daten-Container - <span>Primärer Daten-Container</span> - <span>Beinhaltet Daten genau einer Klasse</span> ] .pull-right55[ <img src="image/main_objects.png"></img> ] --- .pull-left45[ # data frame <ul> <li class="m1"><span>Eine <mono>list</mono>e bestehend aus <high><mono>vector</mono>en gleicher Länge</high>.</span></li> <li class="m2"><span>Die <mono>vector</mono>en haben verschiedene <high>Klassen</high>: <mono>numeric</mono>, <mono>character</mono>, etc.</span></li> <li class="m3"><span>Verschiedene Varianten: <mono>data.frame</mono>, <mono>data.table</mono>, <highm>tibble</highm>.</span></li> </ul> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Inhalt ansehen ```r # printe basel basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse gewicht ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 W 44 174 NA ## 2 2 M 65 180 85.6 ## 3 3 M 31 168 53.9 ## 4 4 M 31 166. 105 ## 5 5 M 24 180. 102. ## # … with 9,995 more rows, and 15 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Inhalt ansehen ```r # Zeige den data frame in neuem tab View(basel) ``` <img src="image/view.png"></img> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Zugriff mit <high>`$`</high> ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 44 65 31 31 24 59 48 53 50 62 73 53 ## [13] 38 26 ## [ reached getOption("max.print") -- omitted 9986 entries ] ``` ```r # Extrahiere die Variable Bildung basel$bildung ``` ``` ## [1] "obligatorisch" "sek III" ## [3] "sek III" "lehre" ## [5] "obligatorisch" "sek III" ## [7] "obligatorisch" "lehre" ## [9] "sek III" "lehre" ## [11] "sek III" "lehre" ## [13] "lehre" "obligatorisch" ## [ reached getOption("max.print") -- omitted 9986 entries ] ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Verändern mit <high>`$`</high> ```r # Teile die Variable einkommen durch 1000 basel$einkommen <- basel$einkommen / 1000 # zeige data frame basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse gewicht ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 W 44 174 NA ## 2 2 M 65 180 85.6 ## 3 3 M 31 168 53.9 ## 4 4 M 31 166. 105 ## 5 5 M 24 180. 102. ## # … with 9,995 more rows, and 15 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # `vector` <ul> <li class="m1"><span>R's <high>primärer Daten-Container</high></span></li> <li class="m2"><span>Kann nur eine einzelne <high>Daten-Klasse</high> beinhalten (und fehlende Werte).</span></li> <li class="m3"><span>Daten-Klassen<br><br> <ul class="level"> <li><span><highm>numeric</highm> - Alle Zahlen</span></li> <li><span><highm>character</highm> - Alle Zeichen (e.g., Namen)</span></li> <li><span><highm>logical</highm> - <mono>TRUE</mono> oder <mono>FALSE</mono></span></li> <li><span>...</span></li> <li><span><highm>NA</highm> - fehlende Werte</span></li> </ul> </span></li> </ul> ] .pull-right4[ <br><br><br> <img src="image/vector.png"></img> ] --- # `numeric` .pull-left45[ <highm>numeric</highm> Vektoren <high>beinhalten Zahlen</high> und nur Zahlen. ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 44 65 31 31 24 59 48 53 50 62 ``` ```r # Zeige die Klasse von Alter class(basel$alter) ``` ``` ## [1] "numeric" ``` ```r # Ist Alter numeric is.numeric(basel$alter) ``` ``` ## [1] TRUE ``` ] .pull-right4[ <img src="image/vector_types_numeric.png"></img> ] --- # `character` .pull-left45[ <highm>character</highm> Vektoren <high>beinhalten alle Zeichen</high> um die herum <high>Anführungszeichen</high> stehen. ```r # Extrahiere die Variable Geschlecht basel$geschlecht ``` ``` ## [1] "W" "M" "M" "M" "M" "M" "W" "W" ``` ```r # Extrahiere die Variable Bildung basel$bildung ``` ``` ## [1] "obligatorisch" "sek III" ## [3] "sek III" "lehre" ## [5] "obligatorisch" "sek III" ## [7] "obligatorisch" "lehre" ``` ] .pull-right4[ <img src="image/vector_types_character.png"></img> ] --- # `character` .pull-left45[ <highm>character</highm> Vektoren <high>beinhalten alle Zeichen</high> um die herum <high>Anführungszeichen</high> stehen. ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 44 65 31 31 24 59 48 53 50 62 ``` ```r # Wandle Alter in character um as.character(basel$alter) ``` ``` ## [1] "44" "65" "31" "31" "24" "59" "48" ## [8] "53" "50" "62" ``` ] .pull-right4[ <img src="image/vector_types_character.png"></img> ] --- # `logical` .pull-left45[ <highm>logical</highm> Vektoren werden typischerweise durch <high>logische Vergleiche</high> erstellt und benutzt um in Data Frames oder Vektoren bestimmte <high>Fälle auszuwählen</high>. ```r # Welche Werte in Geschlecht sind m basel$geschlecht == "M" ``` ``` ## [1] FALSE TRUE TRUE TRUE TRUE TRUE ## [7] FALSE FALSE TRUE FALSE ``` ```r # Welche Werte in Alteer sind kleiner 30 basel$alter < 50 ``` ``` ## [1] TRUE FALSE TRUE TRUE TRUE FALSE ## [7] TRUE FALSE FALSE FALSE ``` ] .pull-right4[ <img src="image/vector_types_logical.png"></img> ] --- # `logical` .pull-left45[ <highm>logical</highm> Vektoren werden typischerweise durch <high>logische Vergleiche</high> erstellt und benutzt um in Data Frames oder Vektoren bestimmte <high>Fälle auszuwählen</high>. <u>Logische Operatoren</u> <high>`==`</high> - ist gleich<br> <high>`<`</high>, <high>`>`</high> - kleiner/grösser als<br> <high>`<=`</high>, <high>`>=`</high> - kleiner/grösser als oder gleich<br> <high>`&`</high>, <high>`&&`</high> - logisches UND<br> <high>`|`</high>, <high>`||`</high> - logisches ODER<br> ] .pull-right4[ <img src="image/vector_types_logical.png"></img> ] --- .pull-left45[ # Zugriff / Ändern via `[ ]` ```r # Extrahiere die Variable Alter alter <- basel$alter alter ``` ``` ## [1] 44 65 31 31 24 59 48 53 50 62 ``` ```r # Extrahiere zweiten Wert alter[2] ``` ``` ## [1] 65 ``` ```r # Ändere zweiten Wert alter[2] <- 100 alter ``` Mehr Info [hier](http://rspatial.org/intr/rst/4-indexing.html). ] .pull-right4[ <br><br><br> <img src="image/vector.png"></img> ] --- <div class="center_text"> <span> Input/Ouput </span> </div> --- # Das <mono>tidyverse</mono> <ul> <li class="m1"><span>Das <a href="https://www.tidyverse.org/"><mono>tidyverse</mono></a> ist eine Sammlung hoch-performanter, nutzerfreundlicher Pakete für eine effiziente Datenaufbereitung und -analyse.</span></li> </ul> <ol style="padding-left:72px"> <li><high><mono>readr</mono> für Daten I/O</high>.</li> <li><high><mono>tibble</mono> für moderne <mono>data_frame</mono>'s</high>.</li> <li><mono>dplyr</mono> für Datenverarbeitung.</li> <li><mono>tidyr</mono> für Datenverarbeitung.</li> <li><mono>ggplot2</mono> für Grafiken.</li> </ol> <table style="cellspacing:0; cellpadding:0; border:none;padding-top:20px" width=90%> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <col width="15%"> <tr> <td bgcolor="white"> <img src="image/hex-readr.png"height=200px></img> </td> <td bgcolor="white"> <img src="image/hex-tibble.png" height=200px></img> </td> <td bgcolor="white"> <img src="image/hex-dplyr.png"height=200px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-tidyr.png"height=200px style="opacity:.2"></img> </td> <td bgcolor="white"> <img src="image/hex-ggplot2.png" height=200px style="opacity:.2"></img> </td> </tr> </table> --- # Datentypen ausserhalb von R <table class="tg" style="cellspacing:0; cellpadding:0; border:none;" width="95%"> <col width=30%> <col width=30%> <col width=30%> <tr> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m1"><span><high>Strukturierte Daten</high> <ul class="level"> <li><span>Delimiter getrennt: <mono>.csv</mono>, <mono>.txt</mono>, etc.</span></li> <li><span>Relationale Datenbanken: <mono>SQL</mono></span></li> </ul> <br><img src="image/structured.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m2"><span><high>Semi-strukturierte Daten</high> <ul class="level"> <li><span>Markup: <mono>.xml</mono>, <mono>.xls</mono>, <mono>.html</mono> etc.</span></li> <li><span>Non markup: <mono>JSON</mono>, <mono>MongoDB</mono></span></li> </ul> <br><img src="image/html.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m3"><span><high>Unstrukturierte Daten</high> <ul class="level"> <li><span>z.B. Text</span></li> </ul> <br><br><br><br><br2><img src="image/text.png" height=250px> </span></li> </ul> </td> </tr> </table> --- # <mono>readr</mono> .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separieren die Spalten.</span></li> <li class="m2"><span>Meist als <high>lokale Textdatei</high> vorliegend.</span></li> <li class="m3"><span><high>Datenklassen</high> werden inferiert.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ <p align="center"> <img src="image/structured.png" height=400> </p> ] --- # <mono>readr</mono> .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separieren die Spalten.</span></li> <li class="m2"><span>Meist als <high>lokale Textdatei</high> vorliegend.</span></li> <li class="m3"><span><high>Datenklassen</high> werden inferiert.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") # Benutze expliziten Delimiter basel <- read_delim("1_Data/basel.csv", delim = ",") basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with 9,995 more rows, and 16 ## # more variables ``` ] --- # <mono>tibble</mono> .pull-left45[ <ul> <li class="m1"><span>Schlanke Version des <high>data frames</high> mit:</span></li> <ul class="level"> <li><span>Informativeren <high>Print</high></span></li> <li><span>Keinen <high>Faktoren</high></span></li> <li><span>Konsistenteres <high>Subsetting</high></span></li> </ul> </ul> <br> <p align="center"> <img src="image/hex-tibble.png" height=200> </p> ] .pull-right45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") # Benutze expliziten Delimiter basel <- read_delim("1_Data/basel.csv", delim = ",") basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with 9,995 more rows, and 16 ## # more variables ``` ] --- # Den Filepath finden .pull-left45[ <ul> <li class="m1"><span>Finde den Filepath mittels RStudio's <high>Auto-Complete</high>.</span></li> <li class="m2"><span>Setze den Cursor zwischen Anführungszeichen und drücke <highm>Tab</highm>.</span></li> </ul> <br> <p align="center"> <img src="image/tab.png" height="150px" style="border-radius:10px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_1.png"></img> </p> ] --- # Den Filepath finden .pull-left45[ <ul> <li class="m1"><span>Finde den Filepath mittels RStudio's <high>Auto-Complete</high>.</span></li> <li class="m2"><span>Setze den Cursor zwischen Anführungszeichen und drücke <highm>Tab</highm>.</span></li> </ul> <br> <p align="center"> <img src="image/tab.png" height="150px" style="border-radius:10px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_2.png"></img> </p> ] --- # Inferierte Datentypen .pull-left45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_double(), ## geschlecht = col_character(), ## bildung = col_character(), ## konfession = col_character(), ## fasnacht = col_character(), ## sehhilfe = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Inferierte Datentypen .pull-left45[ Manchmal ist es nötig `readr` etwas auf die Sprünge zu helfen, damit <high>Datentypen korrekt inferiert</high> werden. ```r # Setze Symbol für fehlende Werte basel <- read_csv("1_Data/basel.csv", na = c('NA')) # Re-inferiere Datentypen basel <- type_convert(basel) ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Verbinge mit MySQL Datenbank con <- dbConnect(MySQL(), user='studiech_rbootca', password='Du*5hA+7NU:8T', dbname='studiech_rbootcamp', host='studie.ch', port = 3306) # Zeige Tabellen dbListTables(con) ``` ``` ## [1] "basel" ``` ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Extrahiere Tabelle Customers basel <- tbl(con, "basel") basel ``` ``` ## # Source: table<basel> [?? x 20] ## # Database: mysql 5.7.26-log-cll-lve ## # [studiech_rbootca@studie.ch:/studiech_rbootcamp] ## id geschlecht alter_jahre groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with more rows, and 16 more ## # variables ``` ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Extrahiere Tabelle Customers basel <- tbl(con, "basel") # Extrahiere CompanyNamer Variable basel %>% pull(konfession) ``` ``` ## [1] "katholisch" "konfessionslos" ## [3] "konfessionslos" "katholisch" ## [5] "katholisch" NA ## [7] "konfessionslos" "katholisch" ## [9] "konfessionslos" "andere" ## [ reached getOption("max.print") -- omitted 9990 entries ] ``` ] --- # Semi-strukturierte Daten <font style="font-size: 16px">mit <a href="https://github.com/r-lib/xml2"><mono>xml2</mono></a> und <a href="https://github.com/hadley/rvest"><mono>rvest</mono></a></font> ```r # Tabelle laden von Wikipedia (mit xml2 und rvest) read_html("https://en.wikipedia.org/wiki/R_(programming_language)") %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table() %>% as_tibble() ``` ``` ## # A tibble: 21 x 3 ## Release Date Description ## <chr> <chr> <chr> ## 1 0.16 "" "This is the last alpha version developed primarily by Ihaka and G… ## 2 0.49 "1997-04-2… "This is the oldest source release which is currently available on… ## 3 0.60 "1997-12-0… "R becomes an official part of the GNU Project. The code is hosted… ## 4 0.65.1 "1999-10-0… "First versions of update.packages and install.packages functions … ## 5 1.0 "2000-02-2… "Considered by its developers stable enough for production use.[55… ## 6 1.4 "2001-12-1… "S4 methods are introduced and the first version for Mac OS X is m… ## 7 1.8 "2003-10-0… "Introduced a flexible condition handling mechanism for signalling… ## 8 2.0 "2004-10-0… "Introduced lazy loading, which enables fast loading of data with … ## 9 2.1 "2005-04-1… "Support for UTF-8 encoding, and the beginnings of internationaliz… ## 10 2.6.2 "2008-02-0… "Last version to support Windows 95, 98, Me and NT 4.0[56]" ## # … with 11 more rows ``` --- # Other data <font style="font-size: 16px">siehe <a href="https://cran.r-project.org/web/packages/rio/vignettes/rio.html">rio</a></font> .pull-left45[ ### `readr` <img src="image/readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="image/haven.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="image/readxl.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- class: middle, center <h1><a href="https://www.dropbox.com/s/cfp1iibrl5nsmi3/TheRBootcamp.zip?dl=1" download>Projekt</a></h1> --- class: middle, center <h1><a href="https://therbootcamp.github.io/EDA_2021Mar/_sessions/Data/Data_practical.html">Practical</a></h1>