Data I/O

class: center, middle, inverse, title-slide

# Data I/O
### The R Bootcamp <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### July 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">Basel, July 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

# Data Input/Output

.pull-left45[

Raw data can come in many shapes and sizes, but <high>R's got you covered</high>.
 
.pull-left65[

| Package     |   Description|
|:----------------|:-------------------------------------------------------|
| `readr`| `.csv`, `.txt`, etc. |
| `haven`| `.sav`, `.sas7bdat`, `.dta`          |
| `readxl`  | `.xls`, `.xlsx` |
| `R.matlab` | `.mat` |
| `jsonlite` | `.json` |
| `rvest` | `.html` |
| `XML`, `xml2` | `.xml` |

]

.pull-right45[

]

---

# Raw (structured) Data

.pull-left45[

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

.pull-right45[

<high>markup data</high>
*.xml, .xls, .html, (.json), etc.*

]

---

# Delim-separated data

.pull-left45[

1 - Most typical file format.

2 - Requires <high>delimiter</high> to separate entries.

]

.pull-right45[

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

---

# `readr`

`readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise):

.pull-left3[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200>

]

.pull-right65[

```r
# Importing data from a file

data <- read_csv(file, ...) # comma-delimited
data <- read_csv2(file, ...) # semicolon-delimeted
data <- read_delim(file, ...) # arbitrary-delimited

# Writing a data frame to a file

write_csv(data_object, file, ...)    # comma-delimited
write_delim(data_object, file, ...)  # arbitrary-delimited
```
]

---

# Finding the file path

.pull-left4[

1 - Identify the file path using the <high>auto-complete</high>.

2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>.

3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high>

]

.pull-right55[

]

---

# Identifying the delimiter

.pull-left5[

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows).
 
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png">

]

.pull-right45[

<center>`baselers.csv`

]

---

# Identifying the delimiter

.pull-left5[

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows).

```r
# Read with explicit column names
baselers <- read_delim(file = ".../baselers.csv",
 delim = c(",")
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling headers

.pull-left5[

1 - `readr`- functions typically expect the <high>column names</high> in the first line.

2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them.

```r
# Read with explicit column names
baselers <- read_csv(file = ".../baselers.csv",
 col_names = c("id",
 "age",
 ...))
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling data types

.pull-left5[

Reading in data, <high>`readr` infers the type of data</high> for each column.

```r
# Read baselers
read_csv(file = "1_Data/baselers.csv")
```

```
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   sex = col_character(),
##   height = col_double(),
##   weight = col_double(),
##   income = col_double(),
##   education = col_character(),
##   confession = col_character(),
##   food = col_double(),
##   fasnacht = col_character(),
##   eyecor = col_character()
## )
```

```
## See spec(...) for full column specifications.
```

```
## # A tibble: 10,000 x 20
## id sex age height weight income education confession children happiness fitness food alcohol
## <int> <chr> <int> <dbl> <dbl> <dbl> <chr> <chr> <int> <int> <int> <dbl> <int>
## 1 1 male 44 174. 113. 6300 SEK_III catholic 2 5 7 610 40
## 2 2 male 65 180. 75.2 10900 obligato… confessio… 2 7 8 1550 0
## 3 3 fema… 31 168. 55.5 5100 SEK_III <NA> 2 7 6 720 14
## 4 4 male 27 209 93.8 4200 SEK_III catholic 2 7 8 680 39
## 5 5 male 24 177. NA 4000 SEK_III catholic 1 5 4 260 19
## 6 6 male 63 187. 67.4 11400 SEK_III evangelic… 0 7 1 1240 0
## 7 7 male 71 152. 83.3 12000 SEK_III evangelic… 2 8 5 1160 38
## 8 8 fema… 41 156. 67.8 7600 SEK_III confessio… 1 7 2 1350 0
## 9 9 male 43 176. 69.3 8500 apprenti… catholic 2 7 5 150 50
## 10 10 fema… 31 166. 66.3 6100 SEK_II catholic 1 6 7 700 0
## # ... with 9,990 more rows, and 7 more variables: tattoos <int>, rhine <int>, datause <int>,
## # consultations <int>, hiking <int>, fasnacht <chr>, eyecor <chr>
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling data types

.pull-left5[

Incorrect data types can be fixed. Typically this involves:

1 - <high>removing character elements</high> from otherwise numeric variables. <br2>
2 - Setting <high>explicit `NA` strings</high> using the `na`-argument. <br2>
3 - Re-running <high>`type_convert`</high>.

```r
# Read baselers
baseslers <- read_csv(file = ".../baselers.csv",
 na = c('NA'))

# Try to fix incorrect data types
baselers <- type_convert(baselers)
```

]

.pull-right45[

<center>`baselers.csv`</center>

]

---

# Other data

R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html).

.pull-left45[
### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right">

```r
# read fixed width files (can be fast)
data <- read_fwf(file, ...)

# read Apache style log files
data <- read_log(file, ...)
```

### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right">

```r
# read SAS's .sas7bat and sas7bcat files
data <- read_sas(file, ...)

# read SPSS's .sav files
data <- read_sav(file, ...)

# etc
```
]

.pull-right45[
### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right">

```r
# read Excel's .xls and xlsx files
data <- read_excel(file, ...)
```
 
### Other

```r
# Read Matlab .mat files
data <- R.matlab::readMat(file, ...)

# Read and wrangle .xml and .html
data <- XML::xmlParseParse(file, ...)

# from package jsonlite: read .json files
data <- jsonlite::read_json(file, ...)
```
]

---

# Websites

R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html).

.pull-left4[

```r
# load package
library(rvest)
library(xml2)

# get html page (abbreviated)
url <- '.../R_(programming_language)' 
page <- read_html(u)

# get xpath (abbreviated)
xpath <- '.../div/table[2]'

# get table using XPath
table <- page %>% 
 html_node(
 xpath = xpath) %>% 
 html_table()
```
]

.pull-right55[

```
## # A tibble: 15 x 3
## Release Date Description 
## <chr> <chr> <chr> 
## 1 0.16 "" This is the last alpha version 
## 2 0.49 1997-04-23 This is the oldest source rele 
## 3 0.60 1997-12-05 "R becomes an official part of "
## 4 0.65.1 1999-10-07 First versions of update.packa 
## 5 1.0 2000-02-29 Considered by its developers s 
## 6 1.4 2001-12-19 "S4 methods are introduced and "
## 7 2.0 2004-10-04 Introduced lazy loading, which 
## 8 2.1 2005-04-18 Support for UTF-8 encoding, an 
## 9 2.11 2010-04-22 Support for Windows 64 bit sys 
## 10 2.13 2011-04-14 Adding a new compiler function 
## 11 2.14 2011-10-31 Added mandatory namespaces for 
## 12 2.15 2012-03-30 "New load balancing functions. "
## 13 3.0 2013-04-03 Support for numeric index valu 
## 14 3.4 2017-04-21 Just-in-time compilation (JIT) 
## 15 3.5 2018-04-23 Packages byte-compiled on inst
```
]

---

# Remote databases

R provides <high>all necessary tools to pull data from or directly work with</high> remote databases such as, e.g., a `SQL` database. Find out more at:

<div class="center_text_2">
 
 <a href="https://db.rstudio.com/">db.rstudio.com</a>
 
</div>

---

# R's data formats

R's own formats provide the possibility to store <high>data as R objects</high> as well as substantial <high>compression</high>.

.pull-left45[
### `.RData` 
1 - Bundles <high>several R objects</high>. 
2 - Loads objects <high>directly into workspace</high>.

```r
# save data as .RData
save(baselers, zuerichers, ...,
     file = "my_data.RData")

# load data from .RData
load("my_data.RData")
```
]

.pull-right45[
### `.RDS` 
1 - Stores <high>single R objects</high>. 
2 - Import is <high>assigned to object</high>.

```r
# save data as .RDS
saveRDS(baselers,
        file = "baselers.rds")

# load data from .RDS
baselers <- readRDS("baselers.rds")
```
]

---

# Practical

<a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/DataIO/Data_IO_practical.html">Link to practical</a>