Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to read and save data. By the end of this practical you will know how to:
Here are the main read-in functions:
Function | Description |
---|---|
read_csv() |
Read flat csv file |
read_sas() |
Read SAS file |
read_sav() |
Read SPSS file |
readRDS() |
Read RDS file |
file(...,'r'), readLines |
Read from file connection |
Here are the main export functions:
Function | Description |
---|---|
write_csv() |
Write flat csv file |
write_sas() |
Write SAS file |
write_sav() |
Write SPSS file |
saveRDS() |
Save RDS file |
file(...,'w'), readLines |
Write to file connection |
This tutorial begins with the titanic data set, which contains records of 1313 passengers of their name (Name
), their passenger class (PClass
), their age (Age
), their sex (Sex
), whether they survived (Survived
), and a numeric coding of their sex (SexCode
).
Later you will be working with a randomly generated artificial data set, which will help demonstrate the relationship of file size and speed.
?setwd()
) or, even better, to use projects. You will learn about projects. For now we will try to work with the full file path, that is beginning from your hard drive or user account. Fortunately, this isn’t difficult either, when using RStudio’s auto-complete functionality. To find a file path simply write to quotation marks, i.e., ""
, and press getwd()
), which will usually be the level of your user account. Another way to begin at this location is to begin by writing ~/
. Try it out write "~/"
into an R script, place the cursor after the slash (/
), and press titanic_csv_path
(titanic_path <- “~/folder_1/folder_2/…/titanic.csv”).tibble
OK, let’s read some data sets. First you need to load the packages readr
and haven
using library()
, i.e., library(readr)
and library(haven)
.
Now, read in the three titanic data files “titanic.csv”, “titanic.sav”, and “titanic.sas7bdat” using read_csv()
, read_sas()
, and read_sav()
by passing on the respective file paths (i.e., data/filename.csv
) and assigning the functions’ outcomes to df_csv
, df_sas
, and df_sav
, respectively. Inspect each of the imported objects using print()
and str()
and typeof
.
Verify that the three data sets are identical. To do this use the is-equal-to operator ==
introduced in the last tutorial and all()
.
Received an error? This is because the data sets contain NA
’s, that is, missing values. One way to see this is via the summary function summary()
. Try it out for each of the data sets.
Fortunately, NA
’s aren’t much of a problem. When you read ?all
(i.e., the help file), you can see that all()
has a second argument called na.rm = FALSE
. Using this argument, you can tell R to ignore NA
’s. The default is set to FALSE
, thus, this feature is deactivated. To activate it you must pass na.rm = T
to all()
(as the second argument). Try it. Are the data frames now equal?
Every read function in R has a corresponding write function. In this section, you will write data to the three previous formats and R’s own .RData
and .RDS
formats. Before turning to writing, however, you want to take care of a aspect associated with older read and write functions: rownames
.
As you can see, the first column in, e.g., df_csv
contains row numbers. Older write functions by default add a row number column to the data, when it writes data to disk. Newer functions, however, do not exhibit this behavior. Overall, rownames
have come a bit out of fashion. Thus, we want to eliminate the first column from df_csv
using deletion by negative indices. Negative indices in subsetting functions such as []
or [[]]
means omit rather than select. E.g., c(1, 2, 3)[-2]
returns the vector without the second value. Try to apply this to df_csv
and assign the result to df_csv_no_rownames
. You are dealing with a data frame. Should you use single []
or double [[]]
brackets? Try both.
Once you removed the first column, write the reduced data frame to disk using write_csv()
. Try using the exact same file path. Does R give you a warning?
Now, let’s try some other formats using the other data set. First, read in my_data.csv
(which you downloaded in the beginning of this practical) as assign to my_data
. Then, write my_data
to disk using write_sas
, write_sav
, and also saveRDS
. Just as write_csv
, each of these functions take the data frame as the first argument and the file path as the second. To obtain the latter adapt the file path by hand to make sure that the file path has the correct file ending, i.e., .sas7bdat
, .sav
, and .RDS
, respectively. While you write the individual data sets pay attention to the speed of execution. Which one was fastest? If you didn’t make out a difference, try again.
You may have noticed that the saveRDS
function was the slowest. This is because .RDS
results compresses the data much more heavily and, thus, also resulrs in the smallest file sizes. To verify this use the file.info()
-function on each of the four files, that is, on their file paths. Look out for the size
element, which gives the size in bytes
.
Most read functions are not limited to reading from the hard drive. Often you can, for instance, also directly read from url.
read_csv()
.The most basic way to handle files is via file connections. In this more advanced section we take a look at how this works.
The first step of working with file connections is to establish a connections using file()
. The main arguments to file()
are a location (a file path, URL, etc.) and a mode indicator, e.g., r
for Open for reading in text mode
. Try opening a connection to the titanic.csv
data set. To do this, you must assign the output of file()
to an object, which then becomes the connection (e.g., con = file("my_path","r")
). If you want to use the URL from above use url()
instead of file()
.
Now that the connection has been established, you can read the file using readLines()
. readLines()
iterates through the file line by line and returns each line as a character string. Try reading the file using readLines()
and store the output in an object. When done close the connection using close(my_con)
and inspect the data.
As you can see the read-in data looks much messier than when using read_csv()
. This is because read_csv()
really does a lot of editing and interpreting for you. If you’re up for the challenge you can try to process the raw data. One approach is to first split each line using the str_split
(package stringr
from the tidyverse
) function using the comma ,
as the pattern
argument. Applying str_split
will return a list
of vector
s, where each vector represents one row of the future data frame. Next, bind the rows to a matrix using do.call(rbind, my_list)
. The only thing then left for you to do is transform the matrix to a data frame (using as_data_frame
or, better, as_tibble
) and to take care of column types and names.
url()
you can establish a connection to a webpage, which you then can use to retrieve information from the webpage. One package that conveniently streamlines extracting information from webpages based is the rvest
package. To illustrate how easy it can be to scrape (aka extract) information from an webpage using R and rvest
, run the following code. It downloads and parses the Milestones table from R’s Wikipedia page.# load package
install.packages('rvest', repos = "https://stat.ethz.ch/CRAN/")
library(rvest)
library(tibble)
library(magrittr)
# get html
url = 'https://en.wikipedia.org/wiki/R_(programming_language)'
page = read_html(url)
# get table
# use XPath from inspect page (e.g., in Chrome)
table = page %>% html_node(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>% html_table()
# create tibble
as.tibble(table)
For more details on all steps of data analysis check out Hadley Wickham’s R for Data Science.
For more advanced content on objects check out Hadley Wickham’s Advanced R.
For more on pirates and data analysis check out the respective chapters in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link