Data

class: center, middle, inverse, title-slide

# Data
### Introduction to Data Science with R <a href='https://therbootcamp.github.io'>www.therbootcamp.com</a> <a href='https://twitter.com/therbootcamp'>@therbootcamp</a>
### October 2018

---

layout: true

<div class="my-footer">
<a href="https://therbootcamp.github.io/">Introduction to Data Science with R, October 2018</a>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;
<a href="https://therbootcamp.github.io/">www.therbootcamp.com</a>
</div>

---

# Data

.pull-left45[

In this session you will get to know...
<ul>
 <li> R's 3 main <high>data types</high> </li><br2>
 <li> a little more about <high>functions</high> </li><br2>
 <li> R's <high>Import/Export</high> functions </li><br2>

]

.pull-right45[

]

---

# 3 Object types for data

.pull-left4[

R has 3 main data objects...

<high>`list`</high> - R's multi-purpose container
- Can carry any data, incl. lists
- Often used for function outputs

<high>`data_frame`</high> - R's spreadsheet
- Specific type of `list`
- Typical data format
- For multi-variable data sets

<high>`vectors`</high> - R's data container
- Actually carries the data
- Contain data of 1 of many types

]

.pull-right55[
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/main_objects.png"></img>
]

---

# `list`

.pull-left45[
 
1 - Can <high>carry any data</high>, incl. `list`s, `data_frame`s, `vector`s, etc.
 
2 - Are often used for <high>function outputs</high>
 
3 - Have <high>named elements</high>.
 
4 - Elements can be <high>inspect</high>ed via `names()` or `str()`.
 
5 - Elements are (typically) <high>select</high>ed by `$`.

]

.pull-right5[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img>
 ]

---

# `list`: Select element using <high>`$`</high>

.pull-left45[

```r
# regression
reg_model <- lm(height ~ sex + age,
 data = baselers)
reg_results <- summary(reg_model)

# get element names
names(reg_results)
```

```
## [1] "call"         "terms"       
## [3] "residuals"    "coefficients"
## [5] "aliased"      "sigma"       
## [7] "df"           "r.squared"
```

```r
# select element using $
reg_results$coefficients
```

```
##               Estimate  t value
## (Intercept) 164.171266 499.5339
## sexmale      13.993699  66.4724
## age          -0.003753  -0.5819
```

]

.pull-right5[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/list.png"></img>
 ]

---

.pull-left45[

# `data_frame`

1 - Are `list`s containing <high>`vector`s of equal length</high> representing the variables.
 
2 - Contain `vector`s of different types: `numeric`, `character`, etc.
 
3 - Have named elements.
 
4 - Elements can be <high>inspect</high>ed via `names()`, `str()`, `print()`, `View()`, or `skimr::skim()`.
 
5 - Elements are (typically) <high>select</high>ed by `$`.
 
6 - Come in different flavors: `data.frame()`, `data.table()`, `tibble()`.

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

.pull-left45[

# Inspect content

```r
# inspect baselers via print
baselers
```

```
## # A tibble: 10,000 x 20
## id sex age height weight income
## <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6300
## 2 2 male 65 180. 75.2 10900
## 3 3 fema… 31 168. 55.5 5100
## 4 4 male 27 209 93.8 4200
## 5 5 male 24 177. NA 4000
## education confession children
## <chr> <chr> <int>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # ... with 9,995 more rows, and 11 more
## # variables
```
]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

.pull-left45[

# Inspect content

```r
# inspect baselers via print
View(baselers)
```

<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/view.png"></img>
]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

.pull-left45[

# Select via <high>`$`</high>

```r
# select age variable
baselers$age
```

```
##  [1] 44 65 31 27 24 63 71 41 43 31 42 31
## [13] 38 49 39 54 78 62 88 74
```

```r
# select age variable
baselers$education
```

```
##  [1] "SEK_III"          
##  [2] "obligatory_school"
##  [3] "SEK_III"          
##  [4] "SEK_III"          
##  [5] "SEK_III"          
##  [6] "SEK_III"          
##  [7] "SEK_III"          
##  [8] "SEK_III"          
##  [9] "apprenticeship"   
## [10] "SEK_II"
```

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

.pull-left45[

# Change/Add via <high>`$`</high>

```r
# compute age in months
baselers$age <- baselers$age * 2

# inspect baselers
baselers
```

```
## # A tibble: 10,000 x 20
## id sex age height weight income
## <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 88 174. 113. 6300
## 2 2 male 130 180. 75.2 10900
## 3 3 fema… 62 168. 55.5 5100
## 4 4 male 54 209 93.8 4200
## 5 5 male 48 177. NA 4000
## education confession children
## <chr> <chr> <int>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # ... with 9,995 more rows, and 11 more
## # variables
```

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

.pull-left45[

# Tidy data

1 - Each variable you measure should be in one column.

2 - Each different observation of that variable should be in a different row.

3 - There should be one table for each "kind" of variable.

4 - If you have multiple tables, they should include a column in the table that allows them to be linked.

see <a href="http://worldpece.org/sites/default/files/datastyle.pdf">The Elements of Data Analytic Style</a> by Jeff Leek

]

.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/data_frame.png"></img>
 ]

---

# `vector`

.pull-left45[

1 - R's <high>basic and, in a way, only data container</high>. 
 
2 - Can contain only a <high>single type of data</high> and missing values. 
 
3 - Data types

&emsp; <high>`numeric`</high> - All numbers 
&emsp; <high>`character`</high> - All characters (e.g., names) 
&emsp; <high>`logical`</high> - `TRUE` or `FALSE` 
&emsp; ... 
&emsp; <high>`NA`</high> - missing values

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img>
 ]

---

# Select/Change/(Add) via `[ ]`

.pull-left45[

```r
# extract vector containing age
age <- baselers$age
age
```

```
## [1]  88 130  62  54  48 126 142  82  86
```

```r
# select value
age[2]
```

```
## [1] 130
```

```r
# change value
age[2] <- 2
age
```

```
## [1]  88   2  62  54  48 126 142  82  86
```

Find more info on indexing [here](http://rspatial.org/intr/rst/4-indexing.html).

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector.png"></img>
 ]

---

# Data types: `numeric`

.pull-left45[

`numeric` vectors are used to store numbers and only numbers.

```r
baselers$age
```

```
## [1]  88 130  62  54  48 126 142  82  86
```

```r
# evaluate type
typeof(baselers$age)
```

```
## [1] "double"
```

```r
is.numeric(baselers$age)
```

```
## [1] TRUE
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `character`

.pull-left45[

`character` vector are used to store data represented by <high>letters and symbols, and all other data</high>.

```r
baselers$sex
```

```
## [1] "male"   "male"   "female" "male"  
## [5] "male"   "male"   "male"   "female"
```

```r
# evaluate type
as.character(baselers$age)
```

```
## [1] "88"  "130" "62"  "54"  "48"  "126"
## [7] "142" "82"  "86"
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `logical`

.pull-left45[

`logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>.

```r
baselers$sex == "male"
```

```
## [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [7]  TRUE FALSE
```

```r
# evaluate type
baselers$age < 30
```

```
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8] TRUE TRUE
```

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

# Data types: `logical`

.pull-left45[

`logical` vector are used to <high>*slice* data</high> aka to select elements or rows. `logical` are typically created from other vectors via <high>logical comparisons</high>.

Logical operators

<high>`==`</high> - is equal to 
<high>`<`</high>, <high>`>`</high> - smaller/greater than 
<high>`≤`</high>, <high>`≥`</high> - smaller/greater than or equal 
<high>`&`</high>, <high>`&&`</high> - logical AND 
<high>`|`</high>, <high>`||`</high> - logical OR

]

.pull-right4[
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/vector_types.png"></img>
 ]

---

.pull-left45[

# Object Classes

1 - R's objects have <high>content and attributes</high>.
 
2 - Attributes include always <high>names</high>, <high>dimensions</high>, and the <high>class</high> (or type) of the object. 
<br2>
3 - <high>Classes</high> are critical because they determine <high>when and how they can be used in functions</high>!

]
.pull-right45[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/object_class.png"></img>
]

---

.pull-left4[

# Functions

Functions have 3 elements:

1 - <high>Name</high>: Used to refer to the function and call (execute) it.

2 - <high>Arguments</high>: Used to provide (data) inputs and to control what the function does. Arguments with default values (e.g., `use = "everything"`) need not be specified. Arguments without default values (e.g., `x`) need be specified. <high>Inputs must have the appropriate class!</high>

3 - <high>Body</high>: The code that uses the inputs (arguments) to produce the desired output. The code of the functions body is based <high>copies of the inputs</high>, which are named according to the arguments names.

]

.pull-right55[
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/function.png"></img>
]

---

# Documentation

.pull-left5[

R documentation (<high>help files</high> and <high>vignettes</high>) will become very easy to use once you are familiar with the basic R vocabulary.

Pay attention to...

<high>Usage</high> - shows how to use function, its arguments and their defaults. <high>Arguments</high> - describes arguments, and their `class`. <high>Value</high> - describes what the function returns. <high>Examples</high> - provide working R code.

```r
# To access help files
?name_of_function

# search help files
??name_of_function
```

]

.pull-right5[

```r
?cor
```
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/help_cor.png" width="500">
]

---

# Raw (structured) Data

.pull-left45[

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

.pull-right45[

<high>markup data</high>
*.xml, .xls, .html, (.json), etc.*

]

---

# Delim-separated data

.pull-left45[

1 - Most typical file format.

2 - Requires <high>delimiter</high> to separate entries.

]

.pull-right45[

<high>delim-separated data</high>
*.csv, .txt, etc.*

]

---

# `readr`

`readr` is a `tidyverse` package that provides convenient functions to **read in** *flat* (non-nested) data files into data frames (`tibble`s to be precise):

.pull-left3[
 

 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/PNG/readr.png" height=200>

]

.pull-right65[

```r
# Importing data from a file

data <- read_csv(file, ...) # comma-delimited
data <- read_csv2(file, ...) # semicolon-delimeted
data <- read_delim(file, ...) # arbitrary-delimited

# Writing a data frame to a file

write_csv(data_object, file, ...)    # comma-delimited
write_delim(data_object, file, ...)  # arbitrary-delimited
```
]

---

# Finding the file path

.pull-left4[

1 - Identify the file path using the <high>auto-complete</high>.

2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the <high>tab key</high>.

3 - Auto-complete begins with the project folder - <high>place your data inside your project folder!</high>

]

.pull-right55[

]

---

# Identifying the delimiter

.pull-left5[

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWrangler* (Mac), *WordPad* (Windows).
 
<img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/find_data.png">

]

.pull-right45[

<center>`baselers.csv`

]

---

# Identifying the delimiter

.pull-left5[

1 - <high>Find the file</high> on your hard drive. Should be in your data folder inside your project.

2 - <high>Open the file</high> in RStudio (right-click on the file in the *Files* pane) a text viewer, e.g., *TextEdit* (Mac), *TextWranger* (Mac), *WordPad* (Windows).

```r
# Read with explicit column names
baselers <-read_delim(file = ".../baselers.csv",
 delim = c(","))
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling headers

.pull-left5[

1 - `readr`- functions typically expect the <high>column names</high> in the first line.

2 - If no column names are available, use the <high>`col_names`-argument</high> to provide them.

```r
# Read with explicit column names
baselers <- read_csv(file = ".../baselers.csv",
 col_names = c("id",
 "age",
 ...))
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling data types

.pull-left5[

Reading in data, <high> `readr` infers the type of data </high> for each column.

```r
# Read baselers
read_csv(file = "1_Data/baselers.csv")
```

```
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   sex = col_character(),
##   height = col_double(),
##   weight = col_double(),
##   income = col_double(),
##   education = col_character(),
##   confession = col_character(),
##   food = col_double(),
##   fasnacht = col_character(),
##   eyecor = col_character()
## )
```

```
## See spec(...) for full column specifications.
```

]

.pull-right45[

<center>`baselers.csv`

]

---

# Handling data types

.pull-left5[

Incorrect data types can be fixed. Typically this involves:

1 - <high>removing character elements</high> from otherwise numeric variables. <br2>
2 - Setting <high>explicit `NA` strings</high> using the `na`-argument. <br2>
3 - Re-running <high>`type_convert`</high>.

```r
# Read baselers
baseslers <- read_csv(file = ".../baselers.csv",
 na = c('NA'))

# Try to fix incorrect data types
baselers <- type_convert(baselers)
```

]

.pull-right45[

<center> `baselers.csv`

]

---

# Other data

R provides <high>read and write functions</high> for practically all data file formats. See [rio](https://cran.r-project.org/web/packages/rio/vignettes/rio.html).

.pull-left45[
### `readr` <img src="http://d33wubrfki0l68.cloudfront.net/66d3133b4a19949d0b9ddb95fc48da074b69fb07/7dfb6/images/hex-readr.png" width="50", align="right">

```r
# read fixed width files (can be fast)
data <- read_fwf(file, ...)

# read Apache style log files
data <- read_log(file, ...)
```

### `haven` <img src="http://haven.tidyverse.org/logo.png" width="50" align="right">

```r
# read SAS's .sas7bat and sas7bcat files
data <- read_sas(file, ...)

# read SPSS's .sav files
data <- read_sav(file, ...)

# etc
```
]

.pull-right45[
### `readxl` <img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" width="50" align="right">

```r
# read Excel's .xls and xlsx files
data <- read_excel(file, ...)
```
 
### Other

```r
# Read Matlab .mat files
data <- R.matlab::readMat(file, ...)

# Read and wrangle .xml and .html
data <- XML::xmlParseParse(file, ...)

# from package jsonlite: read .json files
data <- jsonlite::read_json(file, ...)
```
]

---

# Remote databases

R provides <high>all necessary tools to pull data from or directly work with</high> remote databases such as, e.g., a `SQL` database. Find out more at:

<div class="center_text_2">
 
 <a href="https://db.rstudio.com/">db.rstudio.com</a>
 
</div>

---

# Practical

<a href="https://therbootcamp.github.io/Intro2DataScience_2018Oct/_sessions/Data/Data_practical.html">Link to practical</a>