class: center, middle, inverse, title-slide # Data I/O ### The R Bootcamp
www.therbootcamp.com
@therbootcamp
### July 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">Basel, July 2018</font></a>                                           <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">www.therbootcamp.com</font></a> </span></div> --- # Data Input/Output .pull-left45[ Raw data can come in many shapes and sizes, but <high>R's got you covered</high>. <br><br> .pull-left65[ | Package | Description| |:----------------|:-------------------------------------------------------| | `readr`| `.csv`, `.txt`, etc. | | `haven`| `.sav`, `.sas7bdat`, `.dta` | | `readxl` | `.xls`, `.xlsx` | | `R.matlab` | `.mat` | | `jsonlite` | `.json` | | `rvest` | `.html` | | `XML`, `xml2` | `.xml` | ] ] .pull-right45[ <img src="http://blog.datasift.com/wp-content/uploads/2014/10/ms-files-3.jpg"> ] --- # Raw (structured) Data .pull-left45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] .pull-right45[ <high>markup data</high> *.xml, .xls, .html, (.json), etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/rbootcamp_raw.png"> ] --- # Delim-separated data .pull-left45[ 1 - Most typical file format. 2 - Requires <high>delimiter</high> to separate entries. <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right45[ <high>delim-separated data</high> *.csv, .txt, etc.* <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # `readr` `readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise): .pull-left3[ <br> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200> </p> ] .pull-right65[ <br> ```r # Importing data from a file data <- read_csv(file, ...) # comma-delimited data <- read_csv2(file, ...) # semicolon-delimeted data <- read_delim(file, ...) # arbitrary-delimited # Writing a data frame to a file write_csv(data_object, file, ...) # comma-delimited write_delim(data_object, file, ...) # arbitrary-delimited ``` ] --- # Finding the file path .pull-left4[ 1 - Identify the file path using the <high>auto-complete</high>. 2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>. <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/tab.png" height="80px"></img> </p> 3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high> ] .pull-right55[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/load_baselers_ss.jpg"></img> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows). <br><br><br> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png"> ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Identifying the delimiter .pull-left5[ 1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project. 2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows). <br><br><br> ```r # Read with explicit column names baselers <- read_delim(file = ".../baselers.csv", delim = c(",") ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling headers .pull-left5[ 1 - `readr`- functions typically expect the <high>column names</high> in the first line. 2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them. <br><br><br> ```r # Read with explicit column names baselers <- read_csv(file = ".../baselers.csv", col_names = c("id", "age", ...)) ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Reading in data, <high>`readr` infers the type of data</high> for each column. ```r # Read baselers read_csv(file = "1_Data/baselers.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_integer(), ## sex = col_character(), ## height = col_double(), ## weight = col_double(), ## income = col_double(), ## education = col_character(), ## confession = col_character(), ## food = col_double(), ## fasnacht = col_character(), ## eyecor = col_character() ## ) ``` ``` ## See spec(...) for full column specifications. ``` ``` ## # A tibble: 10,000 x 20 ## id sex age height weight income education confession children happiness fitness food alcohol ## <int> <chr> <int> <dbl> <dbl> <dbl> <chr> <chr> <int> <int> <int> <dbl> <int> ## 1 1 male 44 174. 113. 6300 SEK_III catholic 2 5 7 610 40 ## 2 2 male 65 180. 75.2 10900 obligato… confessio… 2 7 8 1550 0 ## 3 3 fema… 31 168. 55.5 5100 SEK_III <NA> 2 7 6 720 14 ## 4 4 male 27 209 93.8 4200 SEK_III catholic 2 7 8 680 39 ## 5 5 male 24 177. NA 4000 SEK_III catholic 1 5 4 260 19 ## 6 6 male 63 187. 67.4 11400 SEK_III evangelic… 0 7 1 1240 0 ## 7 7 male 71 152. 83.3 12000 SEK_III evangelic… 2 8 5 1160 38 ## 8 8 fema… 41 156. 67.8 7600 SEK_III confessio… 1 7 2 1350 0 ## 9 9 male 43 176. 69.3 8500 apprenti… catholic 2 7 5 150 50 ## 10 10 fema… 31 166. 66.3 6100 SEK_II catholic 1 6 7 700 0 ## # ... with 9,990 more rows, and 7 more variables: tattoos <int>, rhine <int>, datause <int>, ## # consultations <int>, hiking <int>, fasnacht <chr>, eyecor <chr> ``` ] .pull-right45[ <center>`baselers.csv` <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Handling data types .pull-left5[ Incorrect data types can be fixed. Typically this involves: 1 - <high>removing character elements</high> from otherwise numeric variables.<br><br2> 2 - Setting <high>explicit `NA` strings</high> using the `na`-argument.<br><br2> 3 - Re-running <high>`type_convert`</high>.<br><br> ```r # Read baselers baseslers <- read_csv(file = ".../baselers.csv", na = c('NA')) # Try to fix incorrect data types baselers <- type_convert(baselers) ``` ] .pull-right45[ <center>`baselers.csv`</center> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/baselers_raw.png"> ] --- # Other data R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html). .pull-left45[ ### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right"> ```r # read fixed width files (can be fast) data <- read_fwf(file, ...) # read Apache style log files data <- read_log(file, ...) ``` ### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right"> ```r # read SAS's .sas7bat and sas7bcat files data <- read_sas(file, ...) # read SPSS's .sav files data <- read_sav(file, ...) # etc ``` ] .pull-right45[ ### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right"> ```r # read Excel's .xls and xlsx files data <- read_excel(file, ...) ``` <br> ### Other ```r # Read Matlab .mat files data <- R.matlab::readMat(file, ...) # Read and wrangle .xml and .html data <- XML::xmlParseParse(file, ...) # from package jsonlite: read .json files data <- jsonlite::read_json(file, ...) ``` ] --- # Websites R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html). .pull-left4[ ```r # load package library(rvest) library(xml2) # get html page (abbreviated) url <- '.../R_(programming_language)' page <- read_html(u) # get xpath (abbreviated) xpath <- '.../div/table[2]' # get table using XPath table <- page %>% html_node( xpath = xpath) %>% html_table() ``` ] .pull-right55[ ``` ## # A tibble: 15 x 3 ## Release Date Description ## <chr> <chr> <chr> ## 1 0.16 "" This is the last alpha version ## 2 0.49 1997-04-23 This is the oldest source rele ## 3 0.60 1997-12-05 "R becomes an official part of " ## 4 0.65.1 1999-10-07 First versions of update.packa ## 5 1.0 2000-02-29 Considered by its developers s ## 6 1.4 2001-12-19 "S4 methods are introduced and " ## 7 2.0 2004-10-04 Introduced lazy loading, which ## 8 2.1 2005-04-18 Support for UTF-8 encoding, an ## 9 2.11 2010-04-22 Support for Windows 64 bit sys ## 10 2.13 2011-04-14 Adding a new compiler function ## 11 2.14 2011-10-31 Added mandatory namespaces for ## 12 2.15 2012-03-30 "New load balancing functions. " ## 13 3.0 2013-04-03 Support for numeric index valu ## 14 3.4 2017-04-21 Just-in-time compilation (JIT) ## 15 3.5 2018-04-23 Packages byte-compiled on inst ``` ] --- # Remote databases R provides <high>all necessary tools to pull data from or directly work with</high> remote databases such as, e.g., a `SQL` database. Find out more at: <br><br> <div class="center_text_2"> <span> <a href="https://db.rstudio.com/">db.rstudio.com</a> </span> </div> --- # R's data formats R's own formats provide the possibility to store <high>data as R objects</high> as well as substantial <high>compression</high>. .pull-left45[ ### `.RData` 1 - Bundles <high>several R objects</high>.<br> 2 - Loads objects <high>directly into workspace</high>. <br><br><br> ```r # save data as .RData save(baselers, zuerichers, ..., file = "my_data.RData") # load data from .RData load("my_data.RData") ``` ] .pull-right45[ ### `.RDS` 1 - Stores <high>single R objects</high>.<br> 2 - Import is <high>assigned to object</high>. <br><br><br> ```r # save data as .RDS saveRDS(baselers, file = "baselers.rds") # load data from .RDS baselers <- readRDS("baselers.rds") ``` ] --- # Practical <font size=6><b><a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/DataIO/Data_IO_practical.html">Link to practical</a>