class: center, middle, inverse, title-slide # Data ### Introduction to Data Science with R
www.therbootcamp.com
@therbootcamp
### October 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">Introduction to Data Science with R, October 2018</font></a>                           <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">www.therbootcamp.com</font></a> </span></div> --- # Data .pull-left45[ <font size=5> In this session you will get to know... <ul> <li> R's 3 main <high>data types</high> </li><br2> <li> a little more about <high>functions</high> </li><br2> <li> R's <high>Import/Export</high> functions </li><br2> </font> ] .pull-right45[ <img src="http://blog.datasift.com/wp-content/uploads/2014/10/ms-files-3.jpg"> ] --- # 3 Object types for data .pull-left4[ R has 3 main data objects... <high>`list`</high> - R's multi-purpose container - Can carry any data, incl. lists - Often used for function outputs <high>`data_frame`</high> - R's spreadsheet - Specific type of `list` - Typical data format - For multi-variable data sets <high>`vectors`</high> - R's data container - Actually carries the data - Contain data of 1 of many types ] .pull-right55[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/main_objects.png"></img> ] --- # `list` .pull-left45[ <br><br><br> 1 - Can <high>carry any data</high>, incl. `list`s, `data_frame`s, `vector`s, etc. <br><br> 2 - Are often used for <high>function outputs</high> <br><br> 3 - Have <high>named elements</high>. <br><br> 4 - Elements can be <high>inspect</high>ed via `names()` or `str()`. <br><br> 5 - Elements are (typically) <high>select</high>ed by `$`. ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img> ] --- # `list`: Select element using <high>`$`</high> .pull-left45[ ```r # regression reg_model <- lm(height ~ sex + age, data = baselers) reg_results <- summary(reg_model) # get element names names(reg_results) ``` ``` ## [1] "call" "terms" ## [3] "residuals" "coefficients" ## [5] "aliased" "sigma" ## [7] "df" "r.squared" ``` ```r # select element using $ reg_results$coefficients ``` ``` ## Estimate t value ## (Intercept) 164.171266 499.5339 ## sexmale 13.993699 66.4724 ## age -0.003753 -0.5819 ``` ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img> ] --- .pull-left45[ # `data_frame` <br><br> 1 - Are `list`s containing <high>`vector`s of equal length</high> representing the variables. <br><br> 2 - Contain `vector`s of different types: `numeric`, `character`, etc. <br><br> 3 - Have named elements. <br><br> 4 - Elements can be <high>inspect</high>ed via `names()`, `str()`, `print()`, `View()`, or `skimr::skim()`. <br><br> 5 - Elements are (typically) <high>select</high>ed by `$`. <br><br> 6 - Come in different flavors: `data.frame()`, `data.table()`, `tibble()`. ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Inspect content ```r # inspect baselers via print baselers ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <int> <chr> <int> <dbl> <dbl> <dbl> ## 1 1 male 44 174. 113. 6300 ## 2 2 male 65 180. 75.2 10900 ## 3 3 fema… 31 168. 55.5 5100 ## 4 4 male 27 209 93.8 4200 ## 5 5 male 24 177. NA 4000 ## education confession children ## <chr> <chr> <int> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # ... with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Inspect content ```r # inspect baselers via print View(baselers) ``` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/view.png"></img> ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Select via <high>`$`</high> ```r # select age variable baselers$age ``` ``` ## [1] 44 65 31 27 24 63 71 41 43 31 42 31 ## [13] 38 49 39 54 78 62 88 74 ``` ```r # select age variable baselers$education ``` ``` ## [1] "SEK_III" ## [2] "obligatory_school" ## [3] "SEK_III" ## [4] "SEK_III" ## [5] "SEK_III" ## [6] "SEK_III" ## [7] "SEK_III" ## [8] "SEK_III" ## [9] "apprenticeship" ## [10] "SEK_II" ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Change/Add via <high>`$`</high> ```r # compute age in months baselers$age <- baselers$age * 2 # inspect baselers baselers ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 1 male 88 174. 113. 6300 ## 2 2 male 130 180. 75.2 10900 ## 3 3 fema… 62 168. 55.5 5100 ## 4 4 male 54 209 93.8 4200 ## 5 5 male 48 177. NA 4000 ## education confession children ## <chr> <chr> <int> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # ... with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- .pull-left45[ # Tidy data 1 - Each variable you measure should be in one column. 2 - Each different observation of that variable should be in a different row. 3 - There should be one table for each "kind" of variable. 4 - If you have multiple tables, they should include a column in the table that allows them to be linked. <br><br> see <a href="http://worldpece.org/sites/default/files/datastyle.pdf">The Elements of Data Analytic Style</a> by Jeff Leek ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img> ] --- # `vector` .pull-left45[ 1 - R's <high>basic and, in a way, only data container</high>. <br><br> 2 - Can contain only a <high>single type of data</high> and missing values. <br><br> 3 - Data types   <high>`numeric`</high> - All numbers<br>   <high>`character`</high> - All characters (e.g., names)<br>   <high>`logical`</high> - `TRUE` or `FALSE`<br>   ...<br>   <high>`NA`</high> - missing values<br> ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img> ] --- # Select/Change/(Add) via `[ ]` .pull-left45[ ```r # extract vector containing age age <- baselers$age age ``` ``` ## [1] 88 130 62 54 48 126 142 82 86 ``` ```r # select value age[2] ``` ``` ## [1] 130 ``` ```r # change value age[2] <- 2 age ``` ``` ## [1] 88 2 62 54 48 126 142 82 86 ``` <br> Find more info on indexing [here](http://rspatial.org/intr/rst/4-indexing.html). ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img> ] --- # Data types: `numeric` .pull-left45[ `numeric` vectors are used to store numbers and only numbers. ```r baselers$age ``` ``` ## [1] 88 130 62 54 48 126 142 82 86 ``` ```r # evaluate type typeof(baselers$age) ``` ``` ## [1] "double" ``` ```r is.numeric(baselers$age) ``` ``` ## [1] TRUE ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `character` .pull-left45[ `character` vector are used to store data represented by <high>letters and symbols, and all other data</high>. ```r baselers$sex ``` ``` ## [1] "male" "male" "female" "male" ## [5] "male" "male" "male" "female" ``` ```r # evaluate type as.character(baselers$age) ``` ``` ## [1] "88" "130" "62" "54" "48" "126" ## [7] "142" "82" "86" ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `logical` .pull-left45[ `logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>. ```r baselers$sex == "male" ``` ``` ## [1] TRUE TRUE FALSE TRUE TRUE TRUE ## [7] TRUE FALSE ``` ```r # evaluate type baselers$age < 30 ``` ``` ## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [8] TRUE TRUE ``` ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- # Data types: `logical` .pull-left45[ `logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>. <u>Logical operators</u> <high>`==`</high> - is equal to<br> <high>`<`</high>, <high>`>`</high> - smaller/greater than<br> <high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal<br> <high>`&`</high>, <high>`&&`</high> - logical AND<br> <high>`|`</high>, <high>`||`</high> - logical OR<br> ] .pull-right4[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img> ] --- .pull-left45[ # Object Classes <br><br> 1 - R's objects have <high>content and attributes</high>. <br><br> 2 - Attributes include always <high>names</high>, <high>dimensions</high>, and the <high>class</high> (or type) of the object. <br2> 3 - <high>Classes</high> are critical because they determine <high>when and how they can be used in functions</high>! ] .pull-right45[ <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/object_class.png"></img> ] --- .pull-left4[ # Functions Functions have 3 elements: 1 - <high>Name</high>: Used to refer to the function and call (execute) it. 2 - <high>Arguments</high>: Used to provide (data) inputs and to control what the function does. Arguments with default values (e.g., `use = "everything"`) need not be specified. Arguments without default values (e.g., `x`) need be specified. <high>Inputs must have the appropriate class!</high> 3 - <high>Body</high>: The code that uses the inputs (arguments) to produce the desired output. The code of the functions body is based <high>copies of the inputs</high>, which are named according to the arguments names. ] .pull-right55[ <br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/function.png"></img> ] --- # Documentation .pull-left5[ R documentation (<high>help files</high> and <high>vignettes</high>) will become very easy to use once you are familiar with the basic R vocabulary. Pay attention to... <high>Usage</high> - shows how to use function, its arguments and their defaults.<br><high>Arguments</high> - describes arguments, and their `class`.<br><high>Value</high> - describes what the function returns.<br><high>Examples</high> - provide working R code. ```r # To access help files ?name_of_function # search help files ??name_of_function ``` ] .pull-right5[ ```r ?cor ``` <p align="center"><img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/help_cor.png" width="500"></p> ] --- # Raw (structured) Data .pull-left45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] .pull-right45[ <high>markup data</high> *.xml, .xls, .html, (.json), etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/rbootcamp_raw.png"> ] --- # Delim-separated data .pull-left45[ 1 - Most typical file format. 2 - Requires <high>delimiter</high> to separate entries. <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # `readr` `readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise): .pull-left3[ <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right65[ <br> ```r # Importing data from a file data <- read_csv(file, ...) # comma-delimited data <- read_csv2(file, ...) # semicolon-delimeted data <- read_delim(file, ...) # arbitrary-delimited # Writing a data frame to a file write_csv(data_object, file, ...) # comma-delimited write_delim(data_object, file, ...) # arbitrary-delimited ``` ] --- # Finding the file path .pull-left4[ 1 - Identify the file path using the <high>auto-complete</high>. 2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>. <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/tab.png" height="80px"></img> </p> 3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high> ] .pull-right55[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/load_baselers_ss.jpg"></img> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWrangler* (Mac), *WordPad* (Windows). <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png"> ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows). <br><br><br> ```r # Read with explicit column names baselers <-read_delim(file = ".../baselers.csv", delim = c(",")) ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling headers .pull-left5[ 1 - `readr`- functions typically expect the <high>column names</high> in the first line. 2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them. <br><br><br> ```r # Read with explicit column names baselers <- read_csv(file = ".../baselers.csv", col_names = c("id", "age", ...)) ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Reading in data, <high> `readr` infers the type of data </high> for each column. ```r # Read baselers read_csv(file = "1_Data/baselers.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_integer(), ## sex = col_character(), ## height = col_double(), ## weight = col_double(), ## income = col_double(), ## education = col_character(), ## confession = col_character(), ## food = col_double(), ## fasnacht = col_character(), ## eyecor = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income ## <int> <chr> <int> <dbl> <dbl> <dbl> ## 1 1 male 44 174. 113. 6300 ## 2 2 male 65 180. 75.2 10900 ## 3 3 fema… 31 168. 55.5 5100 ## 4 4 male 27 209 93.8 4200 ## 5 5 male 24 177. NA 4000 ## education confession children ## <chr> <chr> <int> ## 1 SEK_III catholic 2 ## 2 obligato… confessio… 2 ## 3 SEK_III <NA> 2 ## 4 SEK_III catholic 2 ## 5 SEK_III catholic 1 ## # ... with 9,995 more rows, and 11 more ## # variables ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Incorrect data types can be fixed. Typically this involves: 1 - <high>removing character elements</high> from otherwise numeric variables.<br><br2> 2 - Setting <high>explicit `NA` strings</high> using the `na`-argument.<br><br2> 3 - Re-running <high>`type_convert`</high>.<br><br> ```r # Read baselers baseslers <- read_csv(file = ".../baselers.csv", na = c('NA')) # Try to fix incorrect data types baselers <- type_convert(baselers) ``` ] .pull-right45[ <center> `baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Other data R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html). .pull-left45[ ### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- # Remote databases R provides <high>all necessary tools to pull data from or directly work with</high> remote databases such as, e.g., a `SQL` database. Find out more at: <br><br> <div class="center_text_2"> <span> <a href="https://db.rstudio.com/">db.rstudio.com</a> </span> </div> --- # Practical <p> <font size=6> <a href="https://therbootcamp.github.io/Intro2DataScience_2018Oct/_sessions/Data/Data_practical.html"><b>Link to practical<b></a> </font> </p>