Data

# Data
### R for Data Science <a href='https://therbootcamp.github.io'> Basel R Bootcamp </a> <a href='https://therbootcamp.github.io/R4DS_2019Feb/'> </a>  <a href='https://therbootcamp.github.io'> </a>  <a href='mailto:therbootcamp@gmail.com'> </a>  <a href='https://www.linkedin.com/company/basel-r-bootcamp/'> </a>
### February 2019

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 www.therbootcamp.com
 
 
 </a>
 <a href="https://therbootcamp.github.io/">
 
 R For Data Science | February 2019
 
 </a>
 
 </div>

---

# Data

.pull-left45[
<br2>

In this session you will get to know...
<ul>
 <li> R's 3 main <high>data types</high> </li><br2>
 <li> a little more about <high>functions</high> </li><br2>
 <li> R's <high>Import/Export</high> functions </li><br2>

]

<img src="image/bigdata.jpg" height=440px> 
from <a href="https://cloudtweaks.com/">cloudtweaks.com</a>

]
---

# 3 Object types for data

R has 3 main data objects...

<high>`list`</high> - R's multi-purpose container
- Can carry any data, incl. lists
- Often used for function outputs

<high>`data_frame`</high> - R's spreadsheet
- Specific type of `list`
- Typical data format
- For multi-variable data sets

<high>`vectors`</high> - R's data container
- Actually carries the data
- Contain data of 1 of many types

]

.pull-right55[
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/main_objects.png"></img>
]

---

# `list`

.pull-left45[
 
1 - Can <high>carry any data</high>, incl. `list`s, `data_frame`s, `vector`s, etc.
 
2 - Are often used for <high>function outputs</high>
 
3 - Have <high>named elements</high>.
 
4 - Elements can be <high>inspect</high>ed via `names()` or `str()`.
 
5 - Elements are (typically) <high>select</high>ed by `$`.

]

.pull-right5[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img>
 ]

---

# `list`: Select element using <high>`$`</high>

```r
# regression
reg_model <- lm(height ~ sex + age,
 data = baselers)
reg_results <- summary(reg_model)

# get element names
names(reg_results)
```

```
## [1] "call"         "terms"       
## [3] "residuals"    "coefficients"
## [5] "aliased"      "sigma"       
## [7] "df"           "r.squared"
```

```r
# select element using $
reg_results$coefficients
```

```
##               Estimate  t value
## (Intercept) 164.171266 499.5339
## sexmale      13.993699  66.4724
## age          -0.003753  -0.5819
```

]

.pull-right5[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img>
 ]

---

# `data_frame`

1 - Are `list`s containing <high>`vector`s of equal length</high> representing the variables.
 
2 - Contain `vector`s of different types: `numeric`, `character`, etc.
 
3 - Have named elements.
 
4 - Elements can be <high>inspect</high>ed via `names()`, `str()`, `print()`, `View()`, or `skimr::skim()`.
 
5 - Elements are (typically) <high>select</high>ed by `$`.
 
6 - Come in different flavors: `data.frame()`, `data.table()`, `tibble()`.

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# Inspect content

```r
# inspect baselers via print
baselers
```

```
## # A tibble: 10,000 x 20
## id sex age height weight income
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6300
## 2 2 male 65 180. 75.2 10900
## 3 3 fema… 31 168. 55.5 5100
## 4 4 male 27 209 93.8 4200
## 5 5 male 24 177. NA 4000
## education confession children
## <chr> <chr> <dbl>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # … with 9,995 more rows, and 11 more
## # variables
```
]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# Inspect content

```r
# View dataframe in a new window
View(baselers)
```

<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/view.png"></img>
]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# Select via <high>`$`</high>

```r
# Access age column from baselers
baselers$age
```

```
##  [1] 44 65 31 27 24 63 71 41 43 31 42 31
## [13] 38 49 39 54 78 62 88 74
```

```r
# Access education column from baselers
baselers$education
```

```
##  [1] "SEK_III"          
##  [2] "obligatory_school"
##  [3] "SEK_III"          
##  [4] "SEK_III"          
##  [5] "SEK_III"          
##  [6] "SEK_III"          
##  [7] "SEK_III"          
##  [8] "SEK_III"          
##  [9] "apprenticeship"   
## [10] "SEK_II"
```

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# Change/Add via <high>`$`</high>

```r
# Divide income by 1000
baselers$income <- baselers$income / 1000

# inspect baselers
baselers
```

```
## # A tibble: 10,000 x 20
## id sex age height weight income
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6.3
## 2 2 male 65 180. 75.2 10.9
## 3 3 fema… 31 168. 55.5 5.1
## 4 4 male 27 209 93.8 4.2
## 5 5 male 24 177. NA 4 
## education confession children
## <chr> <chr> <dbl>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # … with 9,995 more rows, and 11 more
## # variables
```

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# `vector`

1 - R's <high>basic and, in a way, only data container</high>. 
 
2 - Can contain only a <high>single type of data</high> and missing values. 
 
3 - Data types

&emsp; <high>`numeric`</high> - All numbers 
&emsp; <high>`character`</high> - All characters (e.g., names) 
&emsp; <high>`logical`</high> - `TRUE` or `FALSE` 
&emsp; ... 
&emsp; <high>`NA`</high> - missing values

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img>
 ]

---

# Select/Change/(Add) via `[ ]`

```r
# extract vector containing age
age <- baselers$age
age
```

```
## [1] 44 65 31 27 24 63 71 41 43
```

```r
# select value
age[2]
```

```
## [1] 65
```

```r
# change value
age[2] <- 100
age
```

```
## [1]  44 100  31  27  24  63  71  41  43
```

Find more info on indexing [here](http://rspatial.org/intr/rst/4-indexing.html).

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img>
 ]

---

# Data types: `numeric`

`numeric` vectors are used to store numbers and only numbers.

```r
baselers$age
```

```
## [1] 44 65 31 27 24 63 71 41 43
```

```r
# evaluate class
class(baselers$age)
```

```
## [1] "numeric"
```

```r
# is age numeric?
is.numeric(baselers$age)
```

```
## [1] TRUE
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `character`

`character` vectors are used to store data represented by <high>letters and symbols, and all other data</high>.

You can always recognise character vectors by <high>quotation marks " "</high>

```r
baselers$sex
```

```
## [1] "male"   "male"   "female" "male"  
## [5] "male"   "male"   "male"   "female"
```

```r
baselers$education
```

```
## [1] "SEK_III"          
## [2] "obligatory_school"
## [3] "SEK_III"          
## [4] "SEK_III"
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `character`

`character` vectors are used to store data represented by <high>letters and symbols, and all other data</high>.

You can always recognise character vectors by <high>quotation marks " "</high>

```r
baselers$age
```

```
## [1] 44 65 31 27 24 63 71 41
```

```r
# convert age to character
as.character(baselers$age)
```

```
## [1] "44" "65" "31" "27" "24" "63" "71"
## [8] "41" "43"
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `logical`

`logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>.

```r
# which sex values are male?
baselers$sex == "male"
```

```
## [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [7]  TRUE FALSE
```

```r
# which ages are less than 30?
baselers$age < 30
```

```
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE
## [7] FALSE FALSE FALSE
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `logical`

`logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>.

Logical operators

<high>`==`</high> - is equal to 
<high>`<`</high>, <high>`>`</high> - smaller/greater than 
<high>`<=`</high>, <high>`>=`</high> - smaller/greater than or equal 
<high>`&`</high>, <high>`&&`</high> - logical AND 
<high>`|`</high>, <high>`||`</high> - logical OR

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

---

# Raw (structured) Data

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

<high>markup data</high>
*.xml, .xls, .html, (.json), etc.*

]

---

# Delim-separated data

1 - Most typical file format.

2 - Requires <high>delimiter</high> to separate entries.

]

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

---

# `readr`

`readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise):

.pull-left3[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200>

]

```r
# Importing data from a file

data <- read_csv(file, ...) # comma-delimited
data <- read_csv2(file, ...) # semicolon-delimeted
data <- read_delim(file, ...) # arbitrary-delimited

# Writing a data frame to a file

write_csv(data_object, path, ...)    # comma-delimited
write_delim(data_object, path, ...)  # arbitrary-delimited
```
]

---

# Finding the file path

1 - Identify the file path using the <high>auto-complete</high>.

2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>.

3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high>

]

]

---

# Identifying the delimiter

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWrangler* (Mac), *WordPad* (Windows).
 
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png">

]

<center>`baselers.csv`

]

---

# Identifying the delimiter

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows).

```r
# Read with explicit column names
baselers <-read_delim(file = ".../baselers.csv",
 delim = c(","))
```

]

<center>`baselers.csv`

]

---

# Handling headers

1 - `readr`- functions typically expect the <high>column names</high> in the first line.

2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them.

```r
# Read with explicit column names
baselers <- read_csv(file = ".../baselers.csv",
 col_names = c("id",
 "age",
 ...))
```

]

<center>`baselers.csv`

]

---

# Handling data types

Reading in data, <high> `readr` infers the type of data </high> for each column.

```r
# Read baselers
read_csv(file = "1_Data/baselers.csv")
```

```
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   sex = col_character(),
##   education = col_character(),
##   confession = col_character(),
##   fasnacht = col_character(),
##   eyecor = col_character()
## )
```

```
## See spec(...) for full column specifications.
```

]

<center>`baselers.csv`

]

---

# Handling data types

Incorrect data types can be fixed. Typically this involves:

1 - <high>removing character elements</high> from otherwise numeric variables. <br2>
2 - Setting <high>explicit `NA` strings</high> using the `na`-argument. <br2>
3 - Re-running <high>`type_convert`</high>.

```r
# Read baselers
baseslers <- read_csv(file = ".../baselers.csv",
 na = c('NA'))

# Try to fix incorrect data types
baselers <- type_convert(baselers)
```

]

<center> `baselers.csv`

]

---

# Other data

R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html).

.pull-left45[
### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right">

```r
# read fixed width files (can be fast)
data <- read_fwf(file, ...)

# read Apache style log files
data <- read_log(file, ...)
```

### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right">

```r
# read SAS's .sas7bat and sas7bcat files
data <- read_sas(file, ...)

# read SPSS's .sav files
data <- read_sav(file, ...)

# etc
```
]

.pull-right45[
### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right">

```r
# read Excel's .xls and xlsx files
data <- read_excel(file, ...)
```
 
### Other

```r
# Read Matlab .mat files
data <- R.matlab::readMat(file, ...)

# Read and wrangle .xml and .html
data <- XML::xmlParseParse(file, ...)

# from package jsonlite: read .json files
data <- jsonlite::read_json(file, ...)
```
]

---

<h1><a href="https://therbootcamp.github.io/R4DS_2019Feb/_sessions/Data/Data_practical.html">Practical</a></h1>