Practical: Data I/O

Function	Description
`read_csv()`	Read flat csv file
`read_sas()`	Read SAS file
`read_sav()`	Read SPSS file
`readRDS()`	Read RDS file
`file(...,'r'), readLines`	Read from file connection

Function	Description
`write_csv()`	Write flat csv file
`write_sas()`	Write SAS file
`write_sav()`	Write SPSS file
`saveRDS()`	Save RDS file
`file(...,'w'), readLines`	Write to file connection

Tasks

This tutorial begins with the titanic data set, which contains records of 1313 passengers of their name (Name), their passenger class (PClass), their age (Age), their sex (Sex), whether they survived (Survived), and a numeric coding of their sex (SexCode).

Later you will be working with a randomly generated artificial data set, which will help demonstrate the relationship of file size and speed.

Identify the (file)path

Every read or write function requires that you provide a file path or source (e.g., an URL) specifying the location of the file. Ways to facilitate finding the right file path is to either set the working directory (see ?setwd()) or, even better, to use projects. You will learn about projects. For now we will try to work with the full file path, that is beginning from your hard drive or user account. Fortunately, this isn’t difficult either, when using RStudio’s auto-complete functionality. To find a file path simply write to quotation marks, i.e., "", and press with the cursor being in-between the quotation marks. This will open an auto-complete dialog that allows you to level-by-level search through your folder structure and to identify the correct file path. The auto-complete will always begin at the current working directory (see ?getwd()), which will usually be the level of your user account. Another way to begin at this location is to begin by writing ~/. Try it out write "~/" into an R script, place the cursor after the slash (/), and press . Now, try to find the location of each of the titanic data sets on your hard drive and assign the path (which is a character string) to an object, e.g., titanic_csv_path (titanic_path <- “~/folder_1/folder_2/…/titanic.csv”).

Read to `tibble`

OK, let’s read some data sets. First you need to load the packages readr and haven using library(), i.e., library(readr) and library(haven).
Now, read in the three titanic data files “titanic.csv”, “titanic.sav”, and “titanic.sas7bdat” using read_csv(), read_sas(), and read_sav() by passing on the respective file paths (i.e., data/filename.csv) and assigning the functions’ outcomes to df_csv, df_sas, and df_sav, respectively. Inspect each of the imported objects using print() and str() and typeof.

df_csv <- read_csv('data/titanic.csv')

Warning: Missing column names filled in: 'X1' [1]

Parsed with column specification:
cols(
  X1 = col_integer(),
  Name = col_character(),
  PClass = col_character(),
  Age = col_double(),
  Sex = col_character(),
  Survived = col_integer(),
  SexCode = col_integer()
)

df_sas <- read_sas('data/titanic.sas7bdat')
df_sav <- read_sav('data/titanic.sav')
print(df_csv) ; print(df_sas) ; print(df_sav)

# A tibble: 1,313 x 7
      X1 Name                         PClass    Age Sex   Survived SexCode
   <int> <chr>                        <chr>   <dbl> <chr>    <int>   <int>
 1     1 Allen, Miss Elisabeth Walton 1st    29.0   fema…        1       1
 2     2 Allison, Miss Helen Loraine  1st     2.00  fema…        0       1
 3     3 Allison, Mr Hudson Joshua C… 1st    30.0   male         0       0
 4     4 Allison, Mrs Hudson JC (Bes… 1st    25.0   fema…        0       1
 5     5 Allison, Master Hudson Trev… 1st     0.920 male         1       0
 6     6 Anderson, Mr Harry           1st    47.0   male         1       0
 7     7 Andrews, Miss Kornelia Theo… 1st    63.0   fema…        1       1
 8     8 Andrews, Mr Thomas, jr       1st    39.0   male         0       0
 9     9 Appleton, Mrs Edward Dale (… 1st    58.0   fema…        1       1
10    10 Artagaveytia, Mr Ramon       1st    71.0   male         0       0
# ... with 1,303 more rows

# A tibble: 1,313 x 7
      X1 Name                         PClass    Age Sex   Survived SexCode
   <dbl> <chr>                        <chr>   <dbl> <chr>    <dbl>   <dbl>
 1    1. Allen, Miss Elisabeth Walton 1st    29.0   fema…       1.      1.
 2    2. Allison, Miss Helen Loraine  1st     2.00  fema…       0.      1.
 3    3. Allison, Mr Hudson Joshua C… 1st    30.0   male        0.      0.
 4    4. Allison, Mrs Hudson JC (Bes… 1st    25.0   fema…       0.      1.
 5    5. Allison, Master Hudson Trev… 1st     0.920 male        1.      0.
 6    6. Anderson, Mr Harry           1st    47.0   male        1.      0.
 7    7. Andrews, Miss Kornelia Theo… 1st    63.0   fema…       1.      1.
 8    8. Andrews, Mr Thomas, jr       1st    39.0   male        0.      0.
 9    9. Appleton, Mrs Edward Dale (… 1st    58.0   fema…       1.      1.
10   10. Artagaveytia, Mr Ramon       1st    71.0   male        0.      0.
# ... with 1,303 more rows

# A tibble: 1,313 x 7
      X1 Name                         PClass    Age Sex   Survived SexCode
   <dbl> <chr>                        <chr>   <dbl> <chr>    <dbl>   <dbl>
 1    1. Allen, Miss Elisabeth Walton 1st    29.0   fema…       1.      1.
 2    2. Allison, Miss Helen Loraine  1st     2.00  fema…       0.      1.
 3    3. Allison, Mr Hudson Joshua C… 1st    30.0   male        0.      0.
 4    4. Allison, Mrs Hudson JC (Bes… 1st    25.0   fema…       0.      1.
 5    5. Allison, Master Hudson Trev… 1st     0.920 male        1.      0.
 6    6. Anderson, Mr Harry           1st    47.0   male        1.      0.
 7    7. Andrews, Miss Kornelia Theo… 1st    63.0   fema…       1.      1.
 8    8. Andrews, Mr Thomas, jr       1st    39.0   male        0.      0.
 9    9. Appleton, Mrs Edward Dale (… 1st    58.0   fema…       1.      1.
10   10. Artagaveytia, Mr Ramon       1st    71.0   male        0.      0.
# ... with 1,303 more rows

str(df_csv) ; str(df_sas) ; str(df_sav)

Classes 'tbl_df', 'tbl' and 'data.frame':   1313 obs. of  7 variables:
 $ X1      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Name    : chr  "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
 $ PClass  : chr  "1st" "1st" "1st" "1st" ...
 $ Age     : num  29 2 30 25 0.92 47 63 39 58 71 ...
 $ Sex     : chr  "female" "female" "male" "female" ...
 $ Survived: int  1 0 0 0 1 1 1 0 1 0 ...
 $ SexCode : int  1 1 0 1 0 0 1 0 1 0 ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 7
  .. ..$ X1      : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Name    : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ PClass  : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Age     : list()
  .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
  .. ..$ Sex     : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Survived: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ SexCode : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

Classes 'tbl_df', 'tbl' and 'data.frame':   1313 obs. of  7 variables:
 $ X1      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Name    : chr  "Allen, Miss Elisabeth Walton" "Allison, Miss Helen Loraine" "Allison, Mr Hudson Joshua Creighton" "Allison, Mrs Hudson JC (Bessie Waldo Daniels)" ...
 $ PClass  : chr  "1st" "1st" "1st" "1st" ...
 $ Age     : num  29 2 30 25 0.92 47 63 39 58 71 ...
 $ Sex     : chr  "female" "female" "male" "female" ...
 $ Survived: num  1 0 0 0 1 1 1 0 1 0 ...
 $ SexCode : num  1 1 0 1 0 0 1 0 1 0 ...

Classes 'tbl_df', 'tbl' and 'data.frame':   1313 obs. of  7 variables:
 $ X1      : atomic  1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "format.spss")= chr "F8.0"
 $ Name    : atomic  Allen, Miss Elisabeth Walton Allison, Miss Helen Loraine Allison, Mr Hudson Joshua Creighton Allison, Mrs Hudson | __truncated__ ...
  ..- attr(*, "format.spss")= chr "A62"
 $ PClass  : atomic  1st 1st 1st 1st ...
  ..- attr(*, "format.spss")= chr "A3"
 $ Age     : atomic  29 2 30 25 0.92 47 63 39 58 71 ...
  ..- attr(*, "format.spss")= chr "F8.2"
 $ Sex     : atomic  female female male female ...
  ..- attr(*, "format.spss")= chr "A6"
 $ Survived: atomic  1 0 0 0 1 1 1 0 1 0 ...
  ..- attr(*, "format.spss")= chr "F8.0"
 $ SexCode : atomic  1 1 0 1 0 0 1 0 1 0 ...
  ..- attr(*, "format.spss")= chr "F8.0"

Verify that the three data sets are identical. To do this use the is-equal-to operator == introduced in the last tutorial and all().

all(df_csv == df_sas)

[1] NA

all(df_csv == df_sav)

[1] NA

all(df_sas == df_sav)

[1] NA

Received an error? This is because the data sets contain NA’s, that is, missing values. One way to see this is via the summary function summary(). Try it out for each of the data sets.

summary(df_csv)

       X1           Name              PClass               Age       
 Min.   :   1   Length:1313        Length:1313        Min.   : 0.17  
 1st Qu.: 329   Class :character   Class :character   1st Qu.:21.00  
 Median : 657   Mode  :character   Mode  :character   Median :28.00  
 Mean   : 657                                         Mean   :30.40  
 3rd Qu.: 985                                         3rd Qu.:39.00  
 Max.   :1313                                         Max.   :71.00  
                                                      NA's   :557    
     Sex               Survived         SexCode      
 Length:1313        Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median :0.0000   Median :0.0000  
                    Mean   :0.3427   Mean   :0.3519  
                    3rd Qu.:1.0000   3rd Qu.:1.0000  
                    Max.   :1.0000   Max.   :1.0000

summary(df_sas)

       X1           Name              PClass               Age       
 Min.   :   1   Length:1313        Length:1313        Min.   : 0.17  
 1st Qu.: 329   Class :character   Class :character   1st Qu.:21.00  
 Median : 657   Mode  :character   Mode  :character   Median :28.00  
 Mean   : 657                                         Mean   :30.40  
 3rd Qu.: 985                                         3rd Qu.:39.00  
 Max.   :1313                                         Max.   :71.00  
                                                      NA's   :557    
     Sex               Survived         SexCode      
 Length:1313        Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median :0.0000   Median :0.0000  
                    Mean   :0.3427   Mean   :0.3519  
                    3rd Qu.:1.0000   3rd Qu.:1.0000  
                    Max.   :1.0000   Max.   :1.0000

summary(df_sav)

       X1           Name              PClass               Age       
 Min.   :   1   Length:1313        Length:1313        Min.   : 0.17  
 1st Qu.: 329   Class :character   Class :character   1st Qu.:21.00  
 Median : 657   Mode  :character   Mode  :character   Median :28.00  
 Mean   : 657                                         Mean   :30.40  
 3rd Qu.: 985                                         3rd Qu.:39.00  
 Max.   :1313                                         Max.   :71.00  
                                                      NA's   :557    
     Sex               Survived         SexCode      
 Length:1313        Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median :0.0000   Median :0.0000  
                    Mean   :0.3427   Mean   :0.3519  
                    3rd Qu.:1.0000   3rd Qu.:1.0000  
                    Max.   :1.0000   Max.   :1.0000

Fortunately, NA’s aren’t much of a problem. When you read ?all (i.e., the help file), you can see that all() has a second argument called na.rm = FALSE. Using this argument, you can tell R to ignore NA’s. The default is set to FALSE, thus, this feature is deactivated. To activate it you must pass na.rm = T to all() (as the second argument). Try it. Are the data frames now equal?

all(df_csv == df_sas, na.rm = T)

[1] TRUE

all(df_csv == df_sav, na.rm = T)

[1] TRUE

all(df_sas == df_sav, na.rm = T)

[1] TRUE

Write to disk

Every read function in R has a corresponding write function. In this section, you will write data to the three previous formats and R’s own .RData and .RDS formats. Before turning to writing, however, you want to take care of a aspect associated with older read and write functions: rownames.

As you can see, the first column in, e.g., df_csv contains row numbers. Older write functions by default add a row number column to the data, when it writes data to disk. Newer functions, however, do not exhibit this behavior. Overall, rownames have come a bit out of fashion. Thus, we want to eliminate the first column from df_csv using deletion by negative indices. Negative indices in subsetting functions such as [] or [[]] means omit rather than select. E.g., c(1, 2, 3)[-2] returns the vector without the second value. Try to apply this to df_csv and assign the result to df_csv_no_rownames. You are dealing with a data frame. Should you use single [] or double [[]] brackets? Try both.

df_csv_no_rownames = df_csv[-1]
#df_csv_no_rownames = df_csv[[-1]]

Once you removed the first column, write the reduced data frame to disk using write_csv(). Try using the exact same file path. Does R give you a warning?
Now, let’s try some other formats using the other data set. First, read in my_data.csv (which you downloaded in the beginning of this practical) as assign to my_data. Then, write my_data to disk using write_sas, write_sav, and also saveRDS. Just as write_csv, each of these functions take the data frame as the first argument and the file path as the second. To obtain the latter adapt the file path by hand to make sure that the file path has the correct file ending, i.e., .sas7bdat, .sav, and .RDS, respectively. While you write the individual data sets pay attention to the speed of execution. Which one was fastest? If you didn’t make out a difference, try again.

my_data = read_csv("data_files/my_data.csv")

Parsed with column specification:
cols(
  id = col_character(),
  var_1 = col_double(),
  var_2 = col_double()
)

write_sas(my_data,"data_files/my_data.sas7bdat")
write_sav(my_data,"data_files/my_data.sav")
saveRDS(my_data,"data_files/my_data.RDS")

You may have noticed that the saveRDS function was the slowest. This is because .RDS results compresses the data much more heavily and, thus, also resulrs in the smallest file sizes. To verify this use the file.info()-function on each of the four files, that is, on their file paths. Look out for the size element, which gives the size in bytes.

file.info("data_files/my_data.csv")

                           size isdir mode               mtime
data_files/my_data.csv 86932096 FALSE  644 2018-04-04 18:14:22
                                     ctime               atime uid gid
data_files/my_data.csv 2018-04-04 18:14:30 2018-04-12 14:11:46 502  20
                        uname grname
data_files/my_data.csv dwulff  staff

file.info("data_files/my_data.sas7bdat")

                                size isdir mode               mtime
data_files/my_data.sas7bdat 40559616 FALSE  644 2018-04-12 14:11:42
                                          ctime               atime uid
data_files/my_data.sas7bdat 2018-04-12 14:11:42 2018-04-12 14:11:44 502
                            gid  uname grname
data_files/my_data.sas7bdat  20 dwulff  staff

file.info("data_files/my_data.sav")

                           size isdir mode               mtime
data_files/my_data.sav 48000465 FALSE  644 2018-04-12 14:11:43
                                     ctime               atime uid gid
data_files/my_data.sav 2018-04-12 14:11:43 2018-04-12 14:11:45 502  20
                        uname grname
data_files/my_data.sav dwulff  staff

file.info("data_files/my_data.RDS")

                           size isdir mode               mtime
data_files/my_data.RDS 37462136 FALSE  644 2018-04-12 14:11:46
                                     ctime               atime uid gid
data_files/my_data.RDS 2018-04-12 14:11:46 2018-04-12 14:11:46 502  20
                        uname grname
data_files/my_data.RDS dwulff  staff

Read from url

Most read functions are not limited to reading from the hard drive. Often you can, for instance, also directly read from url.

Read the the titanic data set from this link *https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/titanic.csv* using read_csv().

my_data = read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/titanic.csv")

Warning: Missing column names filled in: 'X1' [1]

Parsed with column specification:
cols(
  X1 = col_integer(),
  Name = col_character(),
  PClass = col_character(),
  Age = col_double(),
  Sex = col_character(),
  Survived = col_integer(),
  SexCode = col_integer()
)

File connections

The most basic way to handle files is via file connections. In this more advanced section we take a look at how this works.

The first step of working with file connections is to establish a connections using file(). The main arguments to file() are a location (a file path, URL, etc.) and a mode indicator, e.g., r for Open for reading in text mode. Try opening a connection to the titanic.csv data set. To do this, you must assign the output of file() to an object, which then becomes the connection (e.g., con = file("my_path","r")). If you want to use the URL from above use url() instead of file().

con = file('data_files/titanic.csv', 'r')

Now that the connection has been established, you can read the file using readLines(). readLines() iterates through the file line by line and returns each line as a character string. Try reading the file using readLines() and store the output in an object. When done close the connection using close(my_con) and inspect the data.

lin = readLines(con)

Warning in readLines(con): incomplete final line found on 'data_files/
titanic.csv'

close(con)

As you can see the read-in data looks much messier than when using read_csv(). This is because read_csv() really does a lot of editing and interpreting for you. If you’re up for the challenge you can try to process the raw data. One approach is to first split each line using the str_split (package stringr from the tidyverse) function using the comma , as the pattern argument. Applying str_split will return a list of vectors, where each vector represents one row of the future data frame. Next, bind the rows to a matrix using do.call(rbind, my_list). The only thing then left for you to do is transform the matrix to a data frame (using as_data_frame or, better, as_tibble) and to take care of column types and names.

require(stringi)

Loading required package: stringi

spl <- stri_split_fixed(lin,',')
mat <- do.call(rbind,spl)

Warning in (function (..., deparse.level = 1) : number of columns of result
is not a multiple of vector length (arg 1)

df  <- as.tibble(mat)
# take care of types

Scraping the internet

Finally on to something completely new. As mentioned, connections can also be established to data located outside one’s own hard drive, such as, for instance, the world wide web. Specifically, using url() you can establish a connection to a webpage, which you then can use to retrieve information from the webpage. One package that conveniently streamlines extracting information from webpages based is the rvest package. To illustrate how easy it can be to scrape (aka extract) information from an webpage using R and rvest, run the following code. It downloads and parses the Milestones table from R’s Wikipedia page.

# load package
install.packages('rvest', repos = "https://stat.ethz.ch/CRAN/")


The downloaded binary packages are in
    /var/folders/4j/gkx0z2kn1b5djq50kwgl2wdc0000gp/T//Rtmpzm1XKN/downloaded_packages

library(rvest)

Loading required package: xml2


Attaching package: 'rvest'

The following object is masked from 'package:purrr':

    pluck

The following object is masked from 'package:readr':

    guess_encoding

library(tibble)
library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

# get html
url = 'https://en.wikipedia.org/wiki/R_(programming_language)' 
page = read_html(url)

# get table
# use XPath from inspect page (e.g., in Chrome)
table = page %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table()

# create tibble
as.tibble(table)

# A tibble: 13 x 3
   Release Date       Description                                         
   <chr>   <chr>      <chr>                                               
 1 0.16    ""         "This is the last alpha version developed primarily…
 2 0.49    1997-04-23 This is the oldest source release which is currentl…
 3 0.60    1997-12-05 R becomes an official part of the GNU Project. The …
 4 0.65.1  1999-10-07 First versions of update.packages and install.packa…
 5 1.0     2000-02-29 Considered by its developers stable enough for prod…
 6 1.4     2001-12-19 S4 methods are introduced and the first version for…
 7 2.0     2004-10-04 Introduced lazy loading, which enables fast loading…
 8 2.1     2005-04-18 Support for UTF-8 encoding, and the beginnings of i…
 9 2.11    2010-04-22 Support for Windows 64 bit systems.                 
10 2.13    2011-04-14 Adding a new compiler function that allows speeding…
11 2.14    2011-10-31 Added mandatory namespaces for packages. Added a ne…
12 2.15    2012-03-30 New load balancing functions. Improved serializatio…
13 3.0     2013-04-03 Support for numeric index values 231 and larger on …

Practical: Data I/O

BaselRBootcamp April 2018

Slides

Overview

Functions

Tasks

Identify the (file)path

Read to `tibble`

Write to disk

Read from url

File connections

Scraping the internet

Additional reading

Practical: Data I/O

BaselRBootcamp April 2018

Slides

Overview

Functions

Tasks

Identify the (file)path

Read to tibble

Write to disk

Read from url

File connections

Scraping the internet

Additional reading

Read to `tibble`