class: center, middle, inverse, title-slide # Data ### R for Data Science
Basel R Bootcamp
### February 2019 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> www.therbootcamp.com </font> </span> </a> <a href="https://therbootcamp.github.io/"> <font color="#7E7E7E"> R For Data Science | February 2019 </font> </a> </span> </div> --- # Data .pull-left45[ <br2> <font size=5> In this session you will get to know... <ul> <li> R's 3 main <high>data types</high> </li><br2> <li> a little more about <high>functions</high> </li><br2> <li> R's <high>Import/Export</high> functions </li><br2> </font> ] .pull-right45[ <p align = "center"> <img src="image/bigdata.jpg" height=440px><br> <font style="font-size:10px">from <a href="https://cloudtweaks.com/">cloudtweaks.com</a></font> </p> ] --- # 3 Object types for data .pull-left4[ R has 3 main data objects... <high>`list`</high> - R's multi-purpose container - Can carry any data, incl. lists - Often used for function outputs <high>`data_frame`</high> - R's spreadsheet - Specific type of `list` - Typical data format - For multi-variable data sets <high>`vectors`</high> - R's data container - Actually carries the data - Contain data of 1 of many types ] .pull-right55[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/main_objects.png"></img> ] --- # `list` .pull-left45[ <br><br><br> 1 - Can <high>carry any data</high>, incl. `list`s, `data_frame`s, `vector`s, etc. <br><br> 2 - Are often used for <high>function outputs</high> <br><br> 3 - Have <high>named elements</high>. <br><br> 4 - Elements can be <high>inspect</high>ed via `names()` or `str()`. <br><br> 5 - Elements are (typically) <high>select</high>ed by `$`. ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img> ] --- # `list`: Select element using <high>`$`</high> .pull-left45[ ```r # regression reg_model <- lm(height ~ sex + age, data = baselers) reg_results <- summary(reg_model) # get element names names(reg_results) ``` ``` ## [1] "call" "terms" ## [3] "residuals" "coefficients" ## [5] "aliased" "sigma" ## [7] "df" "r.squared" ``` ```r # select element using $ reg_results$coefficients ``` ``` ## Estimate t value ## (Intercept) 164.171266 499.5339 ## sexmale 13.993699 66.4724 ## age -0.003753 -0.5819 ``` ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img> ] --- .pull-left45[ # `data_frame` <br><br> 1 - Are `list`s containing <high>`vector`s of equal length</high> representing the variables. <br><br> 2 - Contain `vector`s of different types: `numeric`, `character`, etc. <br><br> 3 - Have named elements. <br><br> 4 - Elements can be <high>inspect</high>ed via `names()`, `str()`, `print()`, `View()`, or `skimr::skim()`. <br><br> 5 - Elements are (typically) <high>select</high>ed by `$`. <br><br> 6 - Come in different flavors: `data.frame()`, `data.table()`, `tibble()`. ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Inspect content ```r # inspect baselers via print baselers ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 male 44 174. 113. 6300 ## 2 2 male 65 180. 75.2 10900 ## 3 3 fema… 31 168. 55.5 5100 ## 4 4 male 27 209 93.8 4200 ## 5 5 male 24 177. NA 4000 ## education confession children ## <chr> <chr> <dbl> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # … with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Inspect content ```r # View dataframe in a new window View(baselers) ``` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/view.png"></img> ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Select via <high>`$`</high> ```r # Access age column from baselers baselers$age ``` ``` ## [1] 44 65 31 27 24 63 71 41 43 31 42 31 ## [13] 38 49 39 54 78 62 88 74 ``` ```r # Access education column from baselers baselers$education ``` ``` ## [1] "SEK_III" ## [2] "obligatory_school" ## [3] "SEK_III" ## [4] "SEK_III" ## [5] "SEK_III" ## [6] "SEK_III" ## [7] "SEK_III" ## [8] "SEK_III" ## [9] "apprenticeship" ## [10] "SEK_II" ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Change/Add via <high>`$`</high> ```r # Divide income by 1000 baselers$income <- baselers$income / 1000 # inspect baselers baselers ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 male 44 174. 113. 6.3 ## 2 2 male 65 180. 75.2 10.9 ## 3 3 fema… 31 168. 55.5 5.1 ## 4 4 male 27 209 93.8 4.2 ## 5 5 male 24 177. NA 4 ## education confession children ## <chr> <chr> <dbl> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # … with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- # `vector` .pull-left45[ 1 - R's <high>basic and, in a way, only data container</high>. <br><br> 2 - Can contain only a <high>single type of data</high> and missing values. <br><br> 3 - Data types   <high>`numeric`</high> - All numbers<br>   <high>`character`</high> - All characters (e.g., names)<br>   <high>`logical`</high> - `TRUE` or `FALSE`<br>   ...<br>   <high>`NA`</high> - missing values<br> ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img> ] --- # Select/Change/(Add) via `[ ]` .pull-left45[ ```r # extract vector containing age age <- baselers$age age ``` ``` ## [1] 44 65 31 27 24 63 71 41 43 ``` ```r # select value age[2] ``` ``` ## [1] 65 ``` ```r # change value age[2] <- 100 age ``` ``` ## [1] 44 100 31 27 24 63 71 41 43 ``` <br> Find more info on indexing [here](http://rspatial.org/intr/rst/4-indexing.html). ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img> ] --- # Data types: `numeric` .pull-left45[ `numeric` vectors are used to store numbers and only numbers. ```r baselers$age ``` ``` ## [1] 44 65 31 27 24 63 71 41 43 ``` ```r # evaluate class class(baselers$age) ``` ``` ## [1] "numeric" ``` ```r # is age numeric? is.numeric(baselers$age) ``` ``` ## [1] TRUE ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `character` .pull-left45[ `character` vectors are used to store data represented by <high>letters and symbols, and all other data</high>. You can always recognise character vectors by <high>quotation marks " "</high> ```r baselers$sex ``` ``` ## [1] "male" "male" "female" "male" ## [5] "male" "male" "male" "female" ``` ```r baselers$education ``` ``` ## [1] "SEK_III" ## [2] "obligatory_school" ## [3] "SEK_III" ## [4] "SEK_III" ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `character` .pull-left45[ `character` vectors are used to store data represented by <high>letters and symbols, and all other data</high>. You can always recognise character vectors by <high>quotation marks " "</high> ```r baselers$age ``` ``` ## [1] 44 65 31 27 24 63 71 41 ``` ```r # convert age to character as.character(baselers$age) ``` ``` ## [1] "44" "65" "31" "27" "24" "63" "71" ## [8] "41" "43" ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `logical` .pull-left45[ `logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>. ```r # which sex values are male? baselers$sex == "male" ``` ``` ## [1] TRUE TRUE FALSE TRUE TRUE TRUE ## [7] TRUE FALSE ``` ```r # which ages are less than 30? baselers$age < 30 ``` ``` ## [1] FALSE FALSE FALSE TRUE TRUE FALSE ## [7] FALSE FALSE FALSE ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `logical` .pull-left45[ `logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>. <u>Logical operators</u> <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`<=`</high>, <high>`>=`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- <div class="center_text"> <span> Data I/O </span> </div> --- # Raw (structured) Data .pull-left45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] .pull-right45[ <high>markup data</high> *.xml, .xls, .html, (.json), etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/rbootcamp_raw.png"> ] --- # Delim-separated data .pull-left45[ 1 - Most typical file format. 2 - Requires <high>delimiter</high> to separate entries. <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # `readr` `readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise): .pull-left3[ <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right65[ <br> ```r # Importing data from a file data <- read_csv(file, ...) # comma-delimited data <- read_csv2(file, ...) # semicolon-delimeted data <- read_delim(file, ...) # arbitrary-delimited # Writing a data frame to a file write_csv(data_object, path, ...) # comma-delimited write_delim(data_object, path, ...) # arbitrary-delimited ``` ] --- # Finding the file path .pull-left4[ 1 - Identify the file path using the <high>auto-complete</high>. 2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>. <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/tab.png" height="80px"></img> </p> 3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high> ] .pull-right55[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/load_baselers_ss.jpg"></img> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWrangler* (Mac), *WordPad* (Windows). <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png"> ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows). <br><br><br> ```r # Read with explicit column names baselers <-read_delim(file = ".../baselers.csv", delim = c(",")) ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling headers .pull-left5[ 1 - `readr`- functions typically expect the <high>column names</high> in the first line. 2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them. <br><br><br> ```r # Read with explicit column names baselers <- read_csv(file = ".../baselers.csv", col_names = c("id", "age", ...)) ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Reading in data, <high> `readr` infers the type of data </high> for each column. ```r # Read baselers read_csv(file = "1_Data/baselers.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_double(), ## sex = col_character(), ## education = col_character(), ## confession = col_character(), ## fasnacht = col_character(), ## eyecor = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 male 44 174. 113. 6300 ## 2 2 male 65 180. 75.2 10900 ## 3 3 fema… 31 168. 55.5 5100 ## 4 4 male 27 209 93.8 4200 ## 5 5 male 24 177. NA 4000 ## education confession children ## <chr> <chr> <dbl> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # … with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Incorrect data types can be fixed. Typically this involves: 1 - <high>removing character elements</high> from otherwise numeric variables.<br><br2> 2 - Setting <high>explicit `NA` strings</high> using the `na`-argument.<br><br2> 3 - Re-running <high>`type_convert`</high>.<br><br> ```r # Read baselers baseslers <- read_csv(file = ".../baselers.csv", na = c('NA')) # Try to fix incorrect data types baselers <- type_convert(baselers) ``` ] .pull-right45[ <center> `baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Other data R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html). .pull-left45[ ### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/R4DS_2019Feb/_sessions/Data/Data_practical.html">Practical</a></h1>