+ - 0:00:00
Notes for current slide
Notes for next slide

Data

R for Data Science
Basel R Bootcamp

February 2019

1 / 29

Data

In this session you will get to know...
  • R's 3 main data types
  • a little more about functions
  • R's Import/Export functions
2 / 29

3 Object types for data

R has 3 main data objects...

list - R's multi-purpose container

  • Can carry any data, incl. lists
  • Often used for function outputs

data_frame - R's spreadsheet

  • Specific type of list
  • Typical data format
  • For multi-variable data sets

vectors - R's data container

  • Actually carries the data
  • Contain data of 1 of many types

3 / 29

list




1 - Can carry any data, incl. lists, data_frames, vectors, etc.

2 - Are often used for function outputs

3 - Have named elements.

4 - Elements can be inspected via names() or str().

5 - Elements are (typically) selected by $.

4 / 29

list: Select element using $

# regression
reg_model <- lm(height ~ sex + age,
data = baselers)
reg_results <- summary(reg_model)
# get element names
names(reg_results)
## [1] "call" "terms"
## [3] "residuals" "coefficients"
## [5] "aliased" "sigma"
## [7] "df" "r.squared"
# select element using $
reg_results$coefficients
## Estimate t value
## (Intercept) 164.171266 499.5339
## sexmale 13.993699 66.4724
## age -0.003753 -0.5819

5 / 29

data_frame



1 - Are lists containing vectors of equal length representing the variables.

2 - Contain vectors of different types: numeric, character, etc.

3 - Have named elements.

4 - Elements can be inspected via names(), str(), print(), View(), or skimr::skim().

5 - Elements are (typically) selected by $.

6 - Come in different flavors: data.frame(), data.table(), tibble().




6 / 29

Inspect content

# inspect baselers via print
baselers
## # A tibble: 10,000 x 20
## id sex age height weight income
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6300
## 2 2 male 65 180. 75.2 10900
## 3 3 fema… 31 168. 55.5 5100
## 4 4 male 27 209 93.8 4200
## 5 5 male 24 177. NA 4000
## education confession children
## <chr> <chr> <dbl>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # … with 9,995 more rows, and 11 more
## # variables




7 / 29

Inspect content

# View dataframe in a new window
View(baselers)




8 / 29

Select via $

# Access age column from baselers
baselers$age
## [1] 44 65 31 27 24 63 71 41 43 31 42 31
## [13] 38 49 39 54 78 62 88 74
# Access education column from baselers
baselers$education
## [1] "SEK_III"
## [2] "obligatory_school"
## [3] "SEK_III"
## [4] "SEK_III"
## [5] "SEK_III"
## [6] "SEK_III"
## [7] "SEK_III"
## [8] "SEK_III"
## [9] "apprenticeship"
## [10] "SEK_II"




9 / 29

Change/Add via $

# Divide income by 1000
baselers$income <- baselers$income / 1000
# inspect baselers
baselers
## # A tibble: 10,000 x 20
## id sex age height weight income
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6.3
## 2 2 male 65 180. 75.2 10.9
## 3 3 fema… 31 168. 55.5 5.1
## 4 4 male 27 209 93.8 4.2
## 5 5 male 24 177. NA 4
## education confession children
## <chr> <chr> <dbl>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # … with 9,995 more rows, and 11 more
## # variables




10 / 29

vector

1 - R's basic and, in a way, only data container.

2 - Can contain only a single type of data and missing values.

3 - Data types

numeric - All numbers
character - All characters (e.g., names)
logical - TRUE or FALSE
  ...
NA - missing values

11 / 29

Select/Change/(Add) via [ ]

# extract vector containing age
age <- baselers$age
age
## [1] 44 65 31 27 24 63 71 41 43
# select value
age[2]
## [1] 65
# change value
age[2] <- 100
age
## [1] 44 100 31 27 24 63 71 41 43


Find more info on indexing here.

12 / 29

Data types: numeric

numeric vectors are used to store numbers and only numbers.

baselers$age
## [1] 44 65 31 27 24 63 71 41 43
# evaluate class
class(baselers$age)
## [1] "numeric"
# is age numeric?
is.numeric(baselers$age)
## [1] TRUE

13 / 29

Data types: character

character vectors are used to store data represented by letters and symbols, and all other data.

You can always recognise character vectors by quotation marks " "

baselers$sex
## [1] "male" "male" "female" "male"
## [5] "male" "male" "male" "female"
baselers$education
## [1] "SEK_III"
## [2] "obligatory_school"
## [3] "SEK_III"
## [4] "SEK_III"

14 / 29

Data types: character

character vectors are used to store data represented by letters and symbols, and all other data.

You can always recognise character vectors by quotation marks " "

baselers$age
## [1] 44 65 31 27 24 63 71 41
# convert age to character
as.character(baselers$age)
## [1] "44" "65" "31" "27" "24" "63" "71"
## [8] "41" "43"

15 / 29

Data types: logical

logical vector are used to slice data aka to select elements or rows. logical are typically created from other vectors via logical comparisons.

# which sex values are male?
baselers$sex == "male"
## [1] TRUE TRUE FALSE TRUE TRUE TRUE
## [7] TRUE FALSE
# which ages are less than 30?
baselers$age < 30
## [1] FALSE FALSE FALSE TRUE TRUE FALSE
## [7] FALSE FALSE FALSE

16 / 29

Data types: logical

logical vector are used to slice data aka to select elements or rows. logical are typically created from other vectors via logical comparisons.

Logical operators

== - is equal to

<, > - smaller/greater than

<=, >= - smaller/greater than or equal

&, && - logical AND

|, || - logical OR

17 / 29

Raw (structured) Data

delim-separated data .csv, .txt, etc.

markup data .xml, .xls, .html, (.json), etc.

19 / 29

Delim-separated data

1 - Most typical file format.

2 - Requires delimiter to separate entries.


delim-separated data .csv, .txt, etc.

20 / 29

readr

readr is a tidyverse package that provides convenient functions to read in flat (non-nested) data files into data frames (tibbles to be precise):



# Importing data from a file
data <- read_csv(file, ...) # comma-delimited
data <- read_csv2(file, ...) # semicolon-delimeted
data <- read_delim(file, ...) # arbitrary-delimited
# Writing a data frame to a file
write_csv(data_object, path, ...) # comma-delimited
write_delim(data_object, path, ...) # arbitrary-delimited
21 / 29

Finding the file path

1 - Identify the file path using the auto-complete.

2 - Initiate auto-complete and browse through the folder structure by placing the cursor between two quotation marks and using the tab key.

3 - Auto-complete begins with the project folder - place your data inside your project folder!

22 / 29

Identifying the delimiter

1 - Find the file on your hard drive. Should be in your data folder inside your project.

2 - Open the file in RStudio (right-click on the file in the Files pane) a text viewer, e.g., TextEdit (Mac), TextWrangler (Mac), WordPad (Windows).


baselers.csv

23 / 29

Identifying the delimiter

1 - Find the file on your hard drive. Should be in your data folder inside your project.

2 - Open the file in RStudio (right-click on the file in the Files pane) a text viewer, e.g., TextEdit (Mac), TextWranger (Mac), WordPad (Windows).


# Read with explicit column names
baselers <-read_delim(file = ".../baselers.csv",
delim = c(","))
baselers.csv

24 / 29

Handling headers

1 - readr- functions typically expect the column names in the first line.

2 - If no column names are available, use the col_names-argument to provide them.


# Read with explicit column names
baselers <- read_csv(file = ".../baselers.csv",
col_names = c("id",
"age",
...))
baselers.csv

25 / 29

Handling data types

Reading in data, readr infers the type of data for each column.

# Read baselers
read_csv(file = "1_Data/baselers.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## sex = col_character(),
## education = col_character(),
## confession = col_character(),
## fasnacht = col_character(),
## eyecor = col_character()
## )
## See spec(...) for full column specifications.
## # A tibble: 10,000 x 20
## id sex age height weight income
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 male 44 174. 113. 6300
## 2 2 male 65 180. 75.2 10900
## 3 3 fema… 31 168. 55.5 5100
## 4 4 male 27 209 93.8 4200
## 5 5 male 24 177. NA 4000
## education confession children
## <chr> <chr> <dbl>
## 1 SEK_III catholic 2
## 2 obligato… confessio… 2
## 3 SEK_III <NA> 2
## 4 SEK_III catholic 2
## 5 SEK_III catholic 1
## # … with 9,995 more rows, and 11 more
## # variables
baselers.csv

26 / 29

Handling data types

Incorrect data types can be fixed. Typically this involves:

1 - removing character elements from otherwise numeric variables.
2 - Setting explicit NA strings using the na-argument.
3 - Re-running type_convert.

# Read baselers
baseslers <- read_csv(file = ".../baselers.csv",
na = c('NA'))
# Try to fix incorrect data types
baselers <- type_convert(baselers)
baselers.csv

27 / 29

Other data

R provides read and write functions for practically all data file formats. See rio.

readr

# read fixed width files (can be fast)
data <- read_fwf(file, ...)
# read Apache style log files
data <- read_log(file, ...)

haven

# read SAS's .sas7bat and sas7bcat files
data <- read_sas(file, ...)
# read SPSS's .sav files
data <- read_sav(file, ...)
# etc

readxl

# read Excel's .xls and xlsx files
data <- read_excel(file, ...)


Other

# Read Matlab .mat files
data <- R.matlab::readMat(file, ...)
# Read and wrangle .xml and .html
data <- XML::xmlParseParse(file, ...)
# from package jsonlite: read .json files
data <- jsonlite::read_json(file, ...)
28 / 29

Data

In this session you will get to know...
  • R's 3 main data types
  • a little more about functions
  • R's Import/Export functions
2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow