class: center, middle, inverse, title-slide # Daten ### Explorative Datenanalyse mit R
The R Bootcamp
@
CSS
### Dezember 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> Explorative Datenanalyse mit R @ CSS | Dezember 2019 </font> </a> </span> </div> --- .pull-left45[ # `Tidy` Daten <ul> <li class="m1"><span>Jede Spalte ist eine Variable.</span></li> <li class="m2"><span>Jede Zeile ist eine Beobachtung.</span></li> <li class="m3"><span>Jede Zelle beinhaltet nur einen Wert.</span></li> </ul> <font style="font-size:16px">...nach <a href="https://github.com/tidyverse/tidyr">Hadley Wickham</a></font> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- # 3 Klassen von Datenobjekten .pull-left4[ <high>`list`</high> - R's Mehrzweck-Container - <span>Kann alle Daten beinhalten, inkl. `list`s</span> - <span>Nützlich für komplexe Funktionsoutputs</span> <high>`data_frame`</high> - R's Tabelle - <span>Spezialfall einer `list`</span> - <span>R's `Tidy`-Format für Daten <high>`vector`</high> - R's Daten-Container - <span>Primärer Daten-Container</span> - <span>Beinhaltet Daten genau einer Klasse</span> ] .pull-right55[ <img src="image/main_objects.png"></img> ] --- .pull-left45[ # `data_frame` <ul> <li class="m1"><span>Eine <mono>list</mono>e bestehend aus <high><mono>vector</mono>en gleicher Länge</high>.</span></li> <li class="m2"><span>Die <mono>vector</mono>en haben verschiedene <high>Klassen</high>: <mono>numeric</mono>, <mono>character</mono>, etc.</span></li> <li class="m3"><span>Verschiedene Varianten: <mono>data.frame</mono>, <mono>data.table</mono>, <highm>tibble</highm>.</span></li> </ul> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Inhalt ansehen ```r # printe basel basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse gewicht ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 f 87 165 NA ## 2 2 m 54 175. 85.6 ## 3 3 f 34 147. 53.9 ## 4 4 m 31 166. 105 ## 5 5 m 24 180. 102. ## # … with 9,995 more rows, and 15 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Inhalt ansehen ```r # Zeige den data frame in neuem tab View(basel) ``` <img src="image/view.png"></img> ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Zugriff mit <high>`$`</high> ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 87 54 34 31 24 59 48 53 50 62 73 53 ## [13] 38 26 ## [ reached getOption("max.print") -- omitted 9986 entries ] ``` ```r # Extrahiere die Variable Bildung basel$bildung ``` ``` ## [1] "obligatorisch" "sek III" ## [3] "sek III" "lehre" ## [5] "obligatorisch" "sek III" ## [7] "obligatorisch" "lehre" ## [9] "sek III" "lehre" ## [11] "sek III" "lehre" ## [13] "lehre" "obligatorisch" ## [ reached getOption("max.print") -- omitted 9986 entries ] ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # Verändern mit <high>`$`</high> ```r # Teile die Variable einkommen durch 1000 basel$einkommen <- basel$einkommen / 1000 # zeige data frame basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse gewicht ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 f 87 165 NA ## 2 2 m 54 175. 85.6 ## 3 3 f 34 147. 53.9 ## 4 4 m 31 166. 105 ## 5 5 m 24 180. 102. ## # … with 9,995 more rows, and 15 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="image/data_frame.png"></img> ] --- .pull-left45[ # `vector` <ul> <li class="m1"><span>R's <high>primärer Daten-Container</high></span></li> <li class="m2"><span>Kann nur eine einzelne <high>Daten-Klasse</high> beinhalten (und fehlende Werte).</span></li> <li class="m3"><span>Daten-Klassen<br><br> <ul class="level"> <li><span><highm>numeric</highm> - Alle Zahlen</span></li> <li><span><highm>character</highm> - Alle Zeichen (e.g., Namen)</span></li> <li><span><highm>logical</highm> - <mono>TRUE</mono> oder <mono>FALSE</mono></span></li> <li><span>...</span></li> <li><span><highm>NA</highm> - fehlende Werte</span></li> </ul> </span></li> </ul> ] .pull-right4[ <br><br><br> <img src="image/vector.png"></img> ] --- .pull-left45[ # Zugriff / Ändern via `[ ]` ```r # Extrahiere die Variable Alter alter <- basel$alter alter ``` ``` ## [1] 87 54 34 31 24 59 48 53 50 62 ``` ```r # Extrahiere zweiten Wert alter[2] ``` ``` ## [1] 54 ``` ```r # Ändere zweiten Wert alter[2] <- 100 alter ``` Mehr Info [hier](http://rspatial.org/intr/rst/4-indexing.html). ] .pull-right4[ <br><br><br> <img src="image/vector.png"></img> ] --- # Datentyp: `numeric` .pull-left45[ <highm>numeric</highm> Vektoren <high>beinhalten Zahlen</high> und nur Zahlen. ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 87 54 34 31 24 59 48 53 50 62 ``` ```r # Zeige die Klasse von Alter class(basel$alter) ``` ``` ## [1] "numeric" ``` ```r # Ist Alter numeric is.numeric(basel$alter) ``` ``` ## [1] TRUE ``` ] .pull-right4[ <img src="image/vector_types_numeric.png"></img> ] --- # Datentyp: `character` .pull-left45[ <highm>character</highm> Vektoren <high>beinhalten alle Zeichen</high> um die herum <high>Anführungszeichen</high> stehen. ```r # Extrahiere die Variable Geschlecht basel$geschlecht ``` ``` ## [1] "f" "m" "f" "m" "m" "m" "f" "f" ``` ```r # Extrahiere die Variable Bildung basel$bildung ``` ``` ## [1] "obligatorisch" "sek III" ## [3] "sek III" "lehre" ## [5] "obligatorisch" "sek III" ## [7] "obligatorisch" "lehre" ``` ] .pull-right4[ <img src="image/vector_types_character.png"></img> ] --- # Datentyp: `character` .pull-left45[ <highm>character</highm> Vektoren <high>beinhalten alle Zeichen</high> um die herum <high>Anführungszeichen</high> stehen. ```r # Extrahiere die Variable Alter basel$alter ``` ``` ## [1] 87 54 34 31 24 59 48 53 50 62 ``` ```r # Wandle Alter in character um as.character(basel$alter) ``` ``` ## [1] "87" "54" "34" "31" "24" "59" "48" ## [8] "53" "50" "62" ``` ] .pull-right4[ <img src="image/vector_types_character.png"></img> ] --- # Datentyp: `logical` .pull-left45[ <highm>logical</highm> Vektoren werden typischerweise durch <high>logische Vergleiche</high> erstellt und benutzt um in Data Frames oder Vektoren bestimmte <high>Fälle auszuwählen</high>. ```r # Welche Werte in Geschlecht sind m basel$geschlecht == "m" ``` ``` ## [1] FALSE TRUE FALSE TRUE TRUE TRUE ## [7] FALSE FALSE TRUE FALSE ``` ```r # Welche Werte in Alteer sind kleiner 30 basel$alter < 30 ``` ``` ## [1] FALSE FALSE FALSE FALSE TRUE FALSE ## [7] FALSE FALSE FALSE FALSE ``` ] .pull-right4[ <img src="image/vector_types_logical.png"></img> ] --- # Datentyp: `logical` .pull-left45[ <highm>logical</highm> Vektoren werden typischerweise durch <high>logische Vergleiche</high> erstellt und benutzt um in Data Frames oder Vektoren bestimmte <high>Fälle auszuwählen</high>. <u>Logische Operatoren</u> <high>`==`</high> - ist gleich<br> <high>`<`</high>, <high>`>`</high> - kleiner/grösser als<br> <high>`<=`</high>, <high>`>=`</high> - kleiner/grösser als oder gleich<br> <high>`&`</high>, <high>`&&`</high> - logisches UND<br> <high>`|`</high>, <high>`||`</high> - logisches ODER<br> ] .pull-right4[ <img src="image/vector_types_logical.png"></img> ] --- <div class="center_text"> <span> Input/Ouput </span> </div> --- # Datentypen ausserhalb von R <table class="tg" style="cellspacing:0; cellpadding:0; border:none;" width="95%"> <col width=30%> <col width=30%> <col width=30%> <tr> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m1"><span><high>Strukturierte Daten</high> <ul class="level"> <li><span>Delimiter getrennt: <mono>.csv</mono>, <mono>.txt</mono>, etc.</span></li> <li><span>Relationale Datenbanken: <mono>SQL</mono></span></li> </ul> <br><img src="image/structured.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m2"><span><high>Semi-strukturierte Daten</high> <ul class="level"> <li><span>Markup: <mono>.xml</mono>, <mono>.xls</mono>, <mono>.html</mono> etc.</span></li> <li><span>Non markup: <mono>JSON</mono>, <mono>MongoDB</mono></span></li> </ul> <br><img src="image/html.png" height=250px> </span></li> </ul> </td> <td bgcolor = 'white' style='vertical-align:top'> <ul> <li class="m3"><span><high>Unstrukturierte Daten</high> <ul class="level"> <li><span>z.B. Text</span></li> </ul> <br><br><br><br><br2><img src="image/text.png" height=250px> </span></li> </ul> </td> </tr> </table> --- # Delimiter getrennte Daten .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separieren die Spalten.</span></li> <li class="m2"><span>Meist als <high>lokale Textdatei</high> vorliegend.</span></li> <li class="m3"><span><high>Datenklassen</high> werden inferiert.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ <p align="center"> <img src="image/structured.png" height=400> </p> ] --- # Delimiter getrennte Daten .pull-left45[ <ul> <li class="m1"><span><high>Delimiter</high> separieren die Spalten.</span></li> <li class="m2"><span>Meist als <high>lokale Textdatei</high> vorliegend.</span></li> <li class="m3"><span><high>Datenklassen</high> werden inferiert.</span></li> </ul> <br> <p align="center"> <img src="image/readr.png" height=200> </p> ] .pull-right45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") # Benutze expliziten Delimiter basel <- read_delim("1_Data/basel.csv", delim = ",") basel ``` ``` ## # A tibble: 10,000 x 20 ## id geschlecht alter groesse ## <dbl> <chr> <dbl> <dbl> ## 1 1 f 87 165 ## 2 2 m 54 175. ## 3 3 f 34 147. ## 4 4 m 31 166. ## 5 5 m 24 180. ## # … with 9,995 more rows, and 16 ## # more variables ``` ] --- # Den Filepath finden .pull-left45[ <ul> <li class="m1"><span>Finde den Filepath mittels RStudio's <high>Auto-Complete</high>.</span></li> <li class="m2"><span>Setze den Cursor zwischen Anführungszeichen und drücke <highm>Tab</highm>.</span></li> </ul> <p align="center"> <img src="image/tab.png" height="150px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_1.png"></img> </p> ] --- # Den Filepath finden .pull-left45[ <ul> <li class="m1"><span>Finde den Filepath mittels RStudio's <high>Auto-Complete</high>.</span></li> <li class="m2"><span>Setze den Cursor zwischen Anführungszeichen und drücke <highm>Tab</highm>.</span></li> </ul> <p align="center"> <img src="image/tab.png" height="150px"></img> </p> ] .pull-right45[ <p align="center"> <img src="image/auto_complete_2.png"></img> </p> ] --- # Inferierte Datentypen .pull-left45[ ```r # Lese Basel Datensatz ein basel <- read_csv("1_Data/basel.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_double(), ## geschlecht = col_character(), ## bildung = col_character(), ## konfession = col_character(), ## fasnacht = col_character(), ## sehhilfe = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Inferierte Datentypen .pull-left45[ Manchmal ist es nötig `readr` etwas auf die Sprünge zu helfen, damit <high>Datentypen korrekt inferiert</high> werden. ```r # Setze Symbol für fehlende Werte basel <- read_csv("1_Data/basel.csv", na = c('NA')) # Re-inferiere Datentypen basel <- type_convert(basel) ``` ] .pull-right45[ <img src="image/structured.png"></image> ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Verbinge mit MySQL Datenbank con <- DBI::dbConnect( RMySQL::MySQL(), dbname = "sql7314906", host = "sql7.freesqldatabase.com", user = "sql7314906", password = "eB5nBtSKRz") # Zeige Tabellen DBI::dbListTables(con) ``` ``` ## [1] "categories" "customers" ## [3] "region" "shippers" ## [5] "suppliers" "territories" ``` ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Extrahiere Tabelle Customers kunden <- tbl(con, "customers") kunden ``` ``` ## # Source: table<customers> [?? x 11] ## # Database: mysql ## # 5.5.58-0ubuntu0.14.04.1 ## # [sql7314906@sql7.freesqldatabase.com:/sql7314906] ## CustomerID CompanyName ContactName ## <chr> <chr> <chr> ## 1 ALFKI Alfreds Fu… Maria Ande… ## 2 ANATR Ana Trujil… Ana Trujil… ## 3 ANTON Antonio Mo… Antonio Mo… ## 4 AROUT Around the… Thomas Har… ## 5 BERGS Berglunds … Christina … ## # … with more rows, and 8 more ## # variables ``` ] --- # Relationale Datenbanken <font style="font-size: 16px">siehe <a href="https://db.rstudio.com/">db.rstudio.com</a></font> .pull-left45[ <ul> <li class="m1"><span>R kann direkt <high>auf Datenbanken arbeiten</high>: <mono>MySQL</mono>, <mono>MariaDB</mono>, <mono>BigQuery</mono>, <mono>Redshift</mono>, etc.<br><font style="font-size:16px">Siehe <a href="https://db.rstudio.com/databases">hier</a>.</font></span></li> <li class="m2"><span><mono>dplyr</mono> wird in z.B. <high>automatisch übersetzt.</high></span></li> </ul> ] .pull-right45[ ```r # Extrahiere Tabelle Customers kunden <- tbl(con, "customers") # Extrahiere CompanyNamer Variable kunden %>% pull(CompanyName) ``` ``` ## [1] "Alfreds Futterkiste" ## [2] "Ana Trujillo Emparedados y helados" ## [3] "Antonio Moreno Taquera" ## [4] "Around the Horn" ## [5] "B's Beverages" ## [6] "Berglunds snabbkp" ## [7] "Blauer See Delikatessen" ## [8] "Blido Comidas preparadas" ## [9] "Blondesddsl pre et fils" ## [10] "Bon app'" ## [ reached getOption("max.print") -- omitted 83 entries ] ``` ] --- # Semi-strukturierte Daten <font style="font-size: 16px">mit <a href="https://github.com/r-lib/xml2"><mono>xml2</mono></a> und <a href="https://github.com/hadley/rvest"><mono>rvest</mono></a></font> ```r # Tabelle laden von Wikipedia (mit xml2 und rvest) read_html("https://en.wikipedia.org/wiki/R_(programming_language)") %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table() %>% as_tibble() ``` ``` ## # A tibble: 17 x 3 ## Release Date Description ## <chr> <chr> <chr> ## 1 0.16 "" "This is the last alpha version developed primarily by Ihaka and Gen… ## 2 0.49 1997-04-… This is the oldest source release which is currently available on CR… ## 3 0.60 1997-12-… R becomes an official part of the GNU Project. The code is hosted an… ## 4 0.65.1 1999-10-… First versions of update.packages and install.packages functions for… ## 5 1.0 2000-02-… Considered by its developers stable enough for production use.[49] ## 6 1.4 2001-12-… S4 methods are introduced and the first version for Mac OS X is made… ## 7 1.8 2003-10-… Introduced a flexible condition handling mechanism for signalling an… ## 8 2.0 2004-10-… Introduced lazy loading, which enables fast loading of data with min… ## 9 2.1 2005-04-… Support for UTF-8 encoding, and the beginnings of internationalizati… ## 10 2.11 2010-04-… Support for Windows 64 bit systems. ## # … with 7 more rows ``` --- # Other data <font style="font-size: 16px">siehe <a href="https://cran.r-project.org/web/packages/rio/vignettes/rio.html">rio</a></font> .pull-left45[ ### `readr` <img src="image/readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="image/haven.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="image/readxl.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/EDA_2019CSS/_sessions/Data/Data_practical.html">Practical</a></h1>