Slides

Here a link to the lecture slides for this session: LINK

Overview

In this practical you’ll learn how to read and save data. By the end of this practical you will know how to:

  1. Identify the location of a file (on your hard drive)
  2. Read in data of various types
  3. Use R’s file connections
  4. Scrape the internet (advanced)

Functions

Here are the main read-in functions:

Function Description
read_csv() Read flat csv file
read_sas() Read SAS file
read_sav() Read SPSS file
readRDS() Read RDS file
file(...,'r'), readLines Read from file connection

Here are the main export functions:

Function Description
write_csv() Write flat csv file
write_sas() Write SAS file
write_sav() Write SPSS file
saveRDS() Save RDS file
file(...,'w'), readLines Write to file connection

Tasks

This tutorial begins with the titanic data set, which contains records of 1313 passengers of their name (Name), their passenger class (PClass), their age (Age), their sex (Sex), whether they survived (Survived), and a numeric coding of their sex (SexCode).

Later you will be working with a randomly generated artificial data set, which will help demonstrate the relationship of file size and speed.

Identify the (file)path

  1. Every read or write function requires that you provide a file path or source (e.g., an URL) specifying the location of the file. Ways to facilitate finding the right file path is to either set the working directory (see ?setwd()) or, even better, to use projects. You will learn about projects. For now we will try to work with the full file path, that is beginning from your hard drive or user account. Fortunately, this isn’t difficult either, when using RStudio’s auto-complete functionality. To find a file path simply write to quotation marks, i.e., "", and press with the cursor being in-between the quotation marks. This will open an auto-complete dialog that allows you to level-by-level search through your folder structure and to identify the correct file path. The auto-complete will always begin at the current working directory (see ?getwd()), which will usually be the level of your user account. Another way to begin at this location is to begin by writing ~/. Try it out write "~/" into an R script, place the cursor after the slash (/), and press . Now, try to find the location of each of the titanic data sets on your hard drive and assign the path (which is a character string) to an object, e.g., titanic_csv_path (titanic_path <- “~/folder_1/folder_2/…/titanic.csv”).

Read to tibble

  1. OK, let’s read some data sets. First you need to load the packages readr and haven using library(), i.e., library(readr) and library(haven).

  2. Now, read in the three titanic data files “titanic.csv”, “titanic.sav”, and “titanic.sas7bdat” using read_csv(), read_sas(), and read_sav() by passing on the respective file paths (i.e., data/filename.csv) and assigning the functions’ outcomes to df_csv, df_sas, and df_sav, respectively. Inspect each of the imported objects using print() and str() and typeof.

  3. Verify that the three data sets are identical. To do this use the is-equal-to operator == introduced in the last tutorial and all().

  4. Received an error? This is because the data sets contain NA’s, that is, missing values. One way to see this is via the summary function summary(). Try it out for each of the data sets.

  5. Fortunately, NA’s aren’t much of a problem. When you read ?all (i.e., the help file), you can see that all() has a second argument called na.rm = FALSE. Using this argument, you can tell R to ignore NA’s. The default is set to FALSE, thus, this feature is deactivated. To activate it you must pass na.rm = T to all() (as the second argument). Try it. Are the data frames now equal?

Write to disk

Every read function in R has a corresponding write function. In this section, you will write data to the three previous formats and R’s own .RData and .RDS formats. Before turning to writing, however, you want to take care of a aspect associated with older read and write functions: rownames.

  1. As you can see, the first column in, e.g., df_csv contains row numbers. Older write functions by default add a row number column to the data, when it writes data to disk. Newer functions, however, do not exhibit this behavior. Overall, rownames have come a bit out of fashion. Thus, we want to eliminate the first column from df_csv using deletion by negative indices. Negative indices in subsetting functions such as [] or [[]] means omit rather than select. E.g., c(1, 2, 3)[-2] returns the vector without the second value. Try to apply this to df_csv and assign the result to df_csv_no_rownames. You are dealing with a data frame. Should you use single [] or double [[]] brackets? Try both.

  2. Once you removed the first column, write the reduced data frame to disk using write_csv(). Try using the exact same file path. Does R give you a warning?

  3. Now, let’s try some other formats using the other data set. First, read in my_data.csv (which you downloaded in the beginning of this practical) as assign to my_data. Then, write my_data to disk using write_sas, write_sav, and also saveRDS. Just as write_csv, each of these functions take the data frame as the first argument and the file path as the second. To obtain the latter adapt the file path by hand to make sure that the file path has the correct file ending, i.e., .sas7bdat, .sav, and .RDS, respectively. While you write the individual data sets pay attention to the speed of execution. Which one was fastest? If you didn’t make out a difference, try again.

  4. You may have noticed that the saveRDS function was the slowest. This is because .RDS results compresses the data much more heavily and, thus, also resulrs in the smallest file sizes. To verify this use the file.info()-function on each of the four files, that is, on their file paths. Look out for the size element, which gives the size in bytes.

Read from url

Most read functions are not limited to reading from the hard drive. Often you can, for instance, also directly read from url.

  1. Read the the titanic data set from this link *https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/titanic.csv* using read_csv().

File connections

The most basic way to handle files is via file connections. In this more advanced section we take a look at how this works.

  1. The first step of working with file connections is to establish a connections using file(). The main arguments to file() are a location (a file path, URL, etc.) and a mode indicator, e.g., r for Open for reading in text mode. Try opening a connection to the titanic.csv data set. To do this, you must assign the output of file() to an object, which then becomes the connection (e.g., con = file("my_path","r")). If you want to use the URL from above use url() instead of file().

  2. Now that the connection has been established, you can read the file using readLines(). readLines() iterates through the file line by line and returns each line as a character string. Try reading the file using readLines() and store the output in an object. When done close the connection using close(my_con) and inspect the data.

  3. As you can see the read-in data looks much messier than when using read_csv(). This is because read_csv() really does a lot of editing and interpreting for you. If you’re up for the challenge you can try to process the raw data. One approach is to first split each line using the str_split (package stringr from the tidyverse) function using the comma , as the pattern argument. Applying str_split will return a list of vectors, where each vector represents one row of the future data frame. Next, bind the rows to a matrix using do.call(rbind, my_list). The only thing then left for you to do is transform the matrix to a data frame (using as_data_frame or, better, as_tibble) and to take care of column types and names.

Scraping the internet

  1. Finally on to something completely new. As mentioned, connections can also be established to data located outside one’s own hard drive, such as, for instance, the world wide web. Specifically, using url() you can establish a connection to a webpage, which you then can use to retrieve information from the webpage. One package that conveniently streamlines extracting information from webpages based is the rvest package. To illustrate how easy it can be to scrape (aka extract) information from an webpage using R and rvest, run the following code. It downloads and parses the Milestones table from R’s Wikipedia page.
# load package
install.packages('rvest', repos = "https://stat.ethz.ch/CRAN/")
library(rvest)
library(tibble)
library(magrittr)

# get html
url = 'https://en.wikipedia.org/wiki/R_(programming_language)' 
page = read_html(url)

# get table
# use XPath from inspect page (e.g., in Chrome)
table = page %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table()

# create tibble
as.tibble(table)

Additional reading