Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to read and save data. By the end of this practical you will know how to:
Here are the main read-in functions:
Function | Description |
---|---|
read_csv() |
Read flat csv file |
read_sas() |
Read SAS file |
read_sav() |
Read SPSS file |
readRDS() |
Read RDS file |
file(...,'r'), readLines |
Read from file connection |
Here are the main export functions:
Function | Description |
---|---|
write_csv() |
Write flat csv file |
write_sas() |
Write SAS file |
write_sav() |
Write SPSS file |
saveRDS() |
Save RDS file |
file(...,'w'), readLines |
Write to file connection |
This tutorial begins with the titanic data set, which contains records of 1313 passengers of their name (Name
), their passenger class (PClass
), their age (Age
), their sex (Sex
), whether they survived (Survived
), and a numeric coding of their sex (SexCode
).
Later you will be working with a randomly generated artificial data set, which will help demonstrate the relationship of file size and speed.
?setwd()
) or, even better, to use projects. You will learn about projects. For now we will try to work with the full file path, that is beginning from your hard drive or user account. Fortunately, this isn’t difficult either, when using RStudio’s auto-complete functionality. To find a file path simply write to quotation marks, i.e., ""
, and press getwd()
), which will usually be the level of your user account. Another way to begin at this location is to begin by writing ~/
. Try it out write "~/"
into an R script, place the cursor after the slash (/
), and press titanic_csv_path
(titanic_path <- “~/folder_1/folder_2/…/titanic.csv”).tibble
OK, let’s read some data sets. First you need to load the packages readr
and haven
using library()
, i.e., library(readr)
and library(haven)
.
Now, read in the three titanic data files “titanic.csv”, “titanic.sav”, and “titanic.sas7bdat” using read_csv()
, read_sas()
, and read_sav()
by passing on the respective file paths (i.e., data/filename.csv
) and assigning the functions’ outcomes to df_csv
, df_sas
, and df_sav
, respectively. Inspect each of the imported objects using print()
and str()
and typeof
.
df_csv <- read_csv('data/titanic.csv')
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
Name = col_character(),
PClass = col_character(),
Age = col_double(),
Sex = col_character(),
Survived = col_integer(),
SexCode = col_integer()
)
df_sas <- read_sas('data/titanic.sas7bdat')
df_sav <- read_sav('data/titanic.sav')
print(df_csv) ; print(df_sas) ; print(df_sav)
# A tibble: 1,313 x 7
X1 Name PClass Age Sex Survived SexCode
<int> <chr> <chr> <dbl> <chr> <int> <int>
1 1 Allen, Miss Elisabeth Walton 1st 29.0 fema… 1 1
2 2 Allison, Miss Helen Loraine 1st 2.00 fema… 0 1
3 3 Allison, Mr Hudson Joshua C… 1st 30.0 male 0 0
4 4 Allison, Mrs Hudson JC (Bes… 1st 25.0 fema… 0 1
5 5 Allison, Master Hudson Trev… 1st 0.920 male 1 0
6 6 Anderson, Mr Harry 1st 47.0 male 1 0
7 7 Andrews, Miss Kornelia Theo… 1st 63.0 fema… 1 1
8 8 Andrews, Mr Thomas, jr 1st 39.0 male 0 0
9 9 Appleton, Mrs Edward Dale (… 1st 58.0 fema… 1 1
10 10 Artagaveytia, Mr Ramon 1st 71.0 male 0 0
# ... with 1,303 more rows
# A tibble: 1,313 x 7
X1 Name PClass Age Sex Survived SexCode
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1. Allen, Miss Elisabeth Walton 1st 29.0 fema… 1. 1.
2 2. Allison, Miss Helen Loraine 1st 2.00 fema… 0. 1.
3 3. Allison, Mr Hudson Joshua C… 1st 30.0 male 0. 0.
4 4. Allison, Mrs Hudson JC (Bes… 1st 25.0 fema… 0. 1.
5 5. Allison, Master Hudson Trev… 1st 0.920 male 1. 0.
6 6. Anderson, Mr Harry 1st 47.0 male 1. 0.
7 7. Andrews, Miss Kornelia Theo… 1st 63.0 fema… 1. 1.
8 8. Andrews, Mr Thomas, jr 1st 39.0 male 0. 0.
9 9. Appleton, Mrs Edward Dale (… 1st 58.0 fema… 1. 1.
10 10. Artagaveytia, Mr Ramon 1st 71.0 male 0. 0.
# ... with 1,303 more rows
# A tibble: 1,313 x 7
X1 Name PClass Age Sex Survived SexCode
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 1. Allen, Miss Elisabeth Walton 1st 29.0 fema… 1. 1.
2 2. Allison, Miss Helen Loraine 1st 2.00 fema… 0. 1.
3 3. Allison, Mr Hudson Joshua C… 1st 30.0 male 0. 0.
4 4. Allison, Mrs Hudson JC (Bes… 1st 25.0 fema… 0. 1.
5 5. Allison, Master Hudson Trev… 1st 0.920 male 1. 0.
6 6. Anderson, Mr Harry 1st 47.0 male 1. 0.
7 7. Andrews, Miss Kornelia Theo… 1st 63.0 fema… 1. 1.
8 8. Andrews, Mr Thomas, jr 1st 39.0 male 0. 0.
9 9. Appleton, Mrs Edward Dale (… 1st 58.0 fema… 1. 1.
10 10. Artagaveytia, Mr Ramon 1st 71.0 male 0. 0.
# ... with 1,303 more rows
str(df_csv) ; str(df_sas) ; str(df_sav)
Classes 'tbl_df', 'tbl' and 'data.frame': 1313 obs. of 7 variables:
$ X1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
$ PClass : chr "1st" "1st" "1st" "1st" ...
$ Age : num 29 2 30 25 0.92 47 63 39 58 71 ...
$ Sex : chr "female" "female" "male" "female" ...
$ Survived: int 1 0 0 0 1 1 1 0 1 0 ...
$ SexCode : int 1 1 0 1 0 0 1 0 1 0 ...
- attr(*, "spec")=List of 2
..$ cols :List of 7
.. ..$ X1 : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Name : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ PClass : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Age : list()
.. .. ..- attr(*, "class")= chr "collector_double" "collector"
.. ..$ Sex : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Survived: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ SexCode : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
Classes 'tbl_df', 'tbl' and 'data.frame': 1313 obs. of 7 variables:
$ X1 : num 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
$ PClass : chr "1st" "1st" "1st" "1st" ...
$ Age : num 29 2 30 25 0.92 47 63 39 58 71 ...
$ Sex : chr "female" "female" "male" "female" ...
$ Survived: num 1 0 0 0 1 1 1 0 1 0 ...
$ SexCode : num 1 1 0 1 0 0 1 0 1 0 ...
Classes 'tbl_df', 'tbl' and 'data.frame': 1313 obs. of 7 variables:
$ X1 : atomic 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "format.spss")= chr "F8.0"
$ Name : atomic Allen, Miss Elisabeth Walton Allison, Miss Helen Loraine Allison, Mr Hudson Joshua Creighton Allison, Mrs Hudson | __truncated__ ...
..- attr(*, "format.spss")= chr "A62"
$ PClass : atomic 1st 1st 1st 1st ...
..- attr(*, "format.spss")= chr "A3"
$ Age : atomic 29 2 30 25 0.92 47 63 39 58 71 ...
..- attr(*, "format.spss")= chr "F8.2"
$ Sex : atomic female female male female ...
..- attr(*, "format.spss")= chr "A6"
$ Survived: atomic 1 0 0 0 1 1 1 0 1 0 ...
..- attr(*, "format.spss")= chr "F8.0"
$ SexCode : atomic 1 1 0 1 0 0 1 0 1 0 ...
..- attr(*, "format.spss")= chr "F8.0"
==
introduced in the last tutorial and all()
.all(df_csv == df_sas)
[1] NA
all(df_csv == df_sav)
[1] NA
all(df_sas == df_sav)
[1] NA
NA
’s, that is, missing values. One way to see this is via the summary function summary()
. Try it out for each of the data sets.summary(df_csv)
X1 Name PClass Age
Min. : 1 Length:1313 Length:1313 Min. : 0.17
1st Qu.: 329 Class :character Class :character 1st Qu.:21.00
Median : 657 Mode :character Mode :character Median :28.00
Mean : 657 Mean :30.40
3rd Qu.: 985 3rd Qu.:39.00
Max. :1313 Max. :71.00
NA's :557
Sex Survived SexCode
Length:1313 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :0.0000 Median :0.0000
Mean :0.3427 Mean :0.3519
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
summary(df_sas)
X1 Name PClass Age
Min. : 1 Length:1313 Length:1313 Min. : 0.17
1st Qu.: 329 Class :character Class :character 1st Qu.:21.00
Median : 657 Mode :character Mode :character Median :28.00
Mean : 657 Mean :30.40
3rd Qu.: 985 3rd Qu.:39.00
Max. :1313 Max. :71.00
NA's :557
Sex Survived SexCode
Length:1313 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :0.0000 Median :0.0000
Mean :0.3427 Mean :0.3519
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
summary(df_sav)
X1 Name PClass Age
Min. : 1 Length:1313 Length:1313 Min. : 0.17
1st Qu.: 329 Class :character Class :character 1st Qu.:21.00
Median : 657 Mode :character Mode :character Median :28.00
Mean : 657 Mean :30.40
3rd Qu.: 985 3rd Qu.:39.00
Max. :1313 Max. :71.00
NA's :557
Sex Survived SexCode
Length:1313 Min. :0.0000 Min. :0.0000
Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Median :0.0000 Median :0.0000
Mean :0.3427 Mean :0.3519
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
NA
’s aren’t much of a problem. When you read ?all
(i.e., the help file), you can see that all()
has a second argument called na.rm = FALSE
. Using this argument, you can tell R to ignore NA
’s. The default is set to FALSE
, thus, this feature is deactivated. To activate it you must pass na.rm = T
to all()
(as the second argument). Try it. Are the data frames now equal?all(df_csv == df_sas, na.rm = T)
[1] TRUE
all(df_csv == df_sav, na.rm = T)
[1] TRUE
all(df_sas == df_sav, na.rm = T)
[1] TRUE
Every read function in R has a corresponding write function. In this section, you will write data to the three previous formats and R’s own .RData
and .RDS
formats. Before turning to writing, however, you want to take care of a aspect associated with older read and write functions: rownames
.
df_csv
contains row numbers. Older write functions by default add a row number column to the data, when it writes data to disk. Newer functions, however, do not exhibit this behavior. Overall, rownames
have come a bit out of fashion. Thus, we want to eliminate the first column from df_csv
using deletion by negative indices. Negative indices in subsetting functions such as []
or [[]]
means omit rather than select. E.g., c(1, 2, 3)[-2]
returns the vector without the second value. Try to apply this to df_csv
and assign the result to df_csv_no_rownames
. You are dealing with a data frame. Should you use single []
or double [[]]
brackets? Try both.df_csv_no_rownames = df_csv[-1]
#df_csv_no_rownames = df_csv[[-1]]
Once you removed the first column, write the reduced data frame to disk using write_csv()
. Try using the exact same file path. Does R give you a warning?
Now, let’s try some other formats using the other data set. First, read in my_data.csv
(which you downloaded in the beginning of this practical) as assign to my_data
. Then, write my_data
to disk using write_sas
, write_sav
, and also saveRDS
. Just as write_csv
, each of these functions take the data frame as the first argument and the file path as the second. To obtain the latter adapt the file path by hand to make sure that the file path has the correct file ending, i.e., .sas7bdat
, .sav
, and .RDS
, respectively. While you write the individual data sets pay attention to the speed of execution. Which one was fastest? If you didn’t make out a difference, try again.
my_data = read_csv("data_files/my_data.csv")
Parsed with column specification:
cols(
id = col_character(),
var_1 = col_double(),
var_2 = col_double()
)
write_sas(my_data,"data_files/my_data.sas7bdat")
write_sav(my_data,"data_files/my_data.sav")
saveRDS(my_data,"data_files/my_data.RDS")
saveRDS
function was the slowest. This is because .RDS
results compresses the data much more heavily and, thus, also resulrs in the smallest file sizes. To verify this use the file.info()
-function on each of the four files, that is, on their file paths. Look out for the size
element, which gives the size in bytes
.file.info("data_files/my_data.csv")
size isdir mode mtime
data_files/my_data.csv 86932096 FALSE 644 2018-04-04 18:14:22
ctime atime uid gid
data_files/my_data.csv 2018-04-04 18:14:30 2018-04-12 14:11:46 502 20
uname grname
data_files/my_data.csv dwulff staff
file.info("data_files/my_data.sas7bdat")
size isdir mode mtime
data_files/my_data.sas7bdat 40559616 FALSE 644 2018-04-12 14:11:42
ctime atime uid
data_files/my_data.sas7bdat 2018-04-12 14:11:42 2018-04-12 14:11:44 502
gid uname grname
data_files/my_data.sas7bdat 20 dwulff staff
file.info("data_files/my_data.sav")
size isdir mode mtime
data_files/my_data.sav 48000465 FALSE 644 2018-04-12 14:11:43
ctime atime uid gid
data_files/my_data.sav 2018-04-12 14:11:43 2018-04-12 14:11:45 502 20
uname grname
data_files/my_data.sav dwulff staff
file.info("data_files/my_data.RDS")
size isdir mode mtime
data_files/my_data.RDS 37462136 FALSE 644 2018-04-12 14:11:46
ctime atime uid gid
data_files/my_data.RDS 2018-04-12 14:11:46 2018-04-12 14:11:46 502 20
uname grname
data_files/my_data.RDS dwulff staff
Most read functions are not limited to reading from the hard drive. Often you can, for instance, also directly read from url.
read_csv()
.my_data = read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/titanic.csv")
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
Name = col_character(),
PClass = col_character(),
Age = col_double(),
Sex = col_character(),
Survived = col_integer(),
SexCode = col_integer()
)
The most basic way to handle files is via file connections. In this more advanced section we take a look at how this works.
file()
. The main arguments to file()
are a location (a file path, URL, etc.) and a mode indicator, e.g., r
for Open for reading in text mode
. Try opening a connection to the titanic.csv
data set. To do this, you must assign the output of file()
to an object, which then becomes the connection (e.g., con = file("my_path","r")
). If you want to use the URL from above use url()
instead of file()
.con = file('data_files/titanic.csv', 'r')
readLines()
. readLines()
iterates through the file line by line and returns each line as a character string. Try reading the file using readLines()
and store the output in an object. When done close the connection using close(my_con)
and inspect the data.lin = readLines(con)
Warning in readLines(con): incomplete final line found on 'data_files/
titanic.csv'
close(con)
read_csv()
. This is because read_csv()
really does a lot of editing and interpreting for you. If you’re up for the challenge you can try to process the raw data. One approach is to first split each line using the str_split
(package stringr
from the tidyverse
) function using the comma ,
as the pattern
argument. Applying str_split
will return a list
of vector
s, where each vector represents one row of the future data frame. Next, bind the rows to a matrix using do.call(rbind, my_list)
. The only thing then left for you to do is transform the matrix to a data frame (using as_data_frame
or, better, as_tibble
) and to take care of column types and names.require(stringi)
Loading required package: stringi
spl <- stri_split_fixed(lin,',')
mat <- do.call(rbind,spl)
Warning in (function (..., deparse.level = 1) : number of columns of result
is not a multiple of vector length (arg 1)
df <- as.tibble(mat)
# take care of types
url()
you can establish a connection to a webpage, which you then can use to retrieve information from the webpage. One package that conveniently streamlines extracting information from webpages based is the rvest
package. To illustrate how easy it can be to scrape (aka extract) information from an webpage using R and rvest
, run the following code. It downloads and parses the Milestones table from R’s Wikipedia page.# load package
install.packages('rvest', repos = "https://stat.ethz.ch/CRAN/")
The downloaded binary packages are in
/var/folders/4j/gkx0z2kn1b5djq50kwgl2wdc0000gp/T//Rtmpzm1XKN/downloaded_packages
library(rvest)
Loading required package: xml2
Attaching package: 'rvest'
The following object is masked from 'package:purrr':
pluck
The following object is masked from 'package:readr':
guess_encoding
library(tibble)
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
# get html
url = 'https://en.wikipedia.org/wiki/R_(programming_language)'
page = read_html(url)
# get table
# use XPath from inspect page (e.g., in Chrome)
table = page %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table()
# create tibble
as.tibble(table)
# A tibble: 13 x 3
Release Date Description
<chr> <chr> <chr>
1 0.16 "" "This is the last alpha version developed primarily…
2 0.49 1997-04-23 This is the oldest source release which is currentl…
3 0.60 1997-12-05 R becomes an official part of the GNU Project. The …
4 0.65.1 1999-10-07 First versions of update.packages and install.packa…
5 1.0 2000-02-29 Considered by its developers stable enough for prod…
6 1.4 2001-12-19 S4 methods are introduced and the first version for…
7 2.0 2004-10-04 Introduced lazy loading, which enables fast loading…
8 2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of i…
9 2.11 2010-04-22 Support for Windows 64 bit systems.
10 2.13 2011-04-14 Adding a new compiler function that allows speeding…
11 2.14 2011-10-31 Added mandatory namespaces for packages. Added a ne…
12 2.15 2012-03-30 New load balancing functions. Improved serializatio…
13 3.0 2013-04-03 Support for numeric index values 231 and larger on …
For more details on all steps of data analysis check out Hadley Wickham’s R for Data Science.
For more advanced content on objects check out Hadley Wickham’s Advanced R.
For more on pirates and data analysis check out the respective chapters in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link