Overview

In this practical you’ll learn how to read and save data. By the end of this practical you will know how to:

Identify the location of a file (on your hard drive)
Read in data of various types
Handle header, type, and missing data issues

Datasets

File	Rows	Columns	Description
diamonds.csv	100	7	Subset of the well-known diamonds data set containing specifications and prices of a large number of recorded diamonds.
titanic.xls	1309	14	Information on the survival of titanic passengers.
sleep.sav	271	55	Survey on sleeping behavior completed by staff at the University of Melbourne.
airbnb_zuerich.sas7bdat	2392	20	Data on AirBnB listings in Zürich, Switzerland

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`
`readr`	`install.packages("readr")`
`haven`	`install.packages("haven")`
`readxl`	`install.packages("readxl")`
`xml2`	`install.packages("xml2")`
`rvest`	`install.packages("rvest")`

Glossary

Reading/writing text data

Extension	File Type	Read	Write
`.csv`	Comma-separated text	`read_csv(file)`	`write_csv(x, file)`
`.csv`	Semi-colon separated text	`read_csv2(file)`	not available
`.txt`	Other text	`read_delim(file)`	`write_delim(x, file)`

Reading/writing other data formats

Extension	File Type	Read	Write
`.xls`, `.xlsx`	Excel	`read_excel(file)`	`xlsx::write.xlsx()`
`.sav`	SPSS	`read_spss(file)`	`write_spss(x, file)`
`.sas7bdat`	SAS	`read_sas(file)`	`write_sas(x, file)`

Processing websites

Function	Description
`read_html(file)`, `write_html(x, file)`	Read/write websites files with extension `.html`
`hmlt_node(html)`	Select specific element in `.html`-file
`html_table`	Transform `.html`-table to data frame

Examples

# load packages
library(tidyverse)
library(readxl)
library(haven)

# delim-separated -------------------

# read chickens data
chickens <- read_csv(file = "1_Data/chickens.csv")

# fix header of chickens_nohead.csv with known column names
chickens <- read_csv(file = "1_Data/chickens_nohead.csv",
                     col_names = c("weight", "time", "chick", "diet"))

# fix NA values of chickens_na.csv
chickens <- read_csv(file = "1_Data/chickens_na.csv",
                     na = c('NA', 'NULL'))

# write clean data to disc
write_csv(x = chickens, 
          path = "1_Data/chickens_clean.csv")

# fix types -------------------
# Note: the survey data is fictional!

# remove character from rating
survey$rating[survey$rating == "2,1"] <- 2.1

# rerun type convert
survey <- type_convert(survey)

# other formats -------------------

# .xlsx (Excel)
chickens <- read_excel("1_Data/chickens.xlsx")

# .sav (SPSS)
chickens <- read_spss("1_Data/chickens.sav")

# .sad7bdata (SAS)
chickens <- read_sas("1_Data/chickens.sas7bdat")

# scraping the web ----------------

# load packages for web scraping
library(rvest)
library(xml2)

# get html
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)' 
page <- read_html(url)

# get raw html table
# use XPath from inspect page (e.g., in Chrome)
table_raw <- html_node(page, xpath = '//*[@id="mw-content-text"]/div/table[2]')

# transform to data frame
table <- html_table(table_raw)

# create tibble
as_tibble(table)

Tasks

A - Getting setup

Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code.
Open a new R script and save it as a new file called dataio_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library().

B - Read delim-separated text files

In this section, you will read in a subset of the well known diamonds data set and prepare it for data analysis.

Identify the file path to the diamonds.csv dataset by using the "" (quotation marks) auto-complete trick. Place the cursor between two quotation marks, hit ⇥ (tab-key), and browse through the folders. Save the file path in an object called diamonds_path.

# place cursor in-between "" and hit tab
diamonds_path <- ""

Using read_csv() and the diamonds_path object, read the diamonds.csv dataset as a new object called diamonds.

# read diamonds data
diamonds <- read_csv(file = XX)

Print the diamonds data and inspect the column names in the header line. Something’s wrong!
Fix the header by reading in the data again using the col_names-argument. Assign to col_names a character vector containing the correct column names: carat, cut, color, clarity, depth, table, price.

# read diamonds data while specifying col_names
diamonds <- read_csv(file = "XX", 
                     col_names = XX)  # Vector of column names

Re-inspect the header by printing the data. Has the header been fixed?
Now pay attention to the classes of the individuals columns (variables). Have all classes been identified correctly? What about the carat column? It should be numeric, right?
Let’s see what went wrong. Select and print the carat variable to identify the one entry that caused the variable to become a character vector (Hint: look for a comma).
Change the carat variable’s element at position XX to the same value YY but with a period instead of a comma. Use the code below.

# Change the value at position XX to YY
diamonds$carat[XX] <- YY

Ok you fixed the value but carat is still character. We can fix it with the type_convert() function. Apply the type_convert() function to the diamonds data to have R fix all the data types. Make sure to assign the result back to diamonds so that you change the object!
Print the diamonds object and look at the column types. Has the type of carat changed to double?
The data is now ready for analysis. Let’s calculate a few statistics. What is the average carat or price (use mean())? What cut and clarity levels exist and how often do they occur (use table())? You can learn more about the variable values from the help file ?diamonds.

C - Write delim-separated text files

Write the, now, properly formatted diamonds data to your data folder as a .csv file using the name diamonds_clean.csv. Don’t forget to include both the file name and the folder (separated by /) in the character string specifying the path argument.

# write clean diamonds data to disc
write_csv(x = XX, path = "XX")

Read diamonds_clean.csv back into R as a new object called diamonds_clean. Then, print the object and verify that this time the types been correctly identified from the start.
Using write_delim, write the data to disc with a different delimiter calling it diamonds_delim.csv. Specifically, set the delim-argument to "@".

# write with "@" as delimiter
write_delim(x = XX, 
            path = "XX", 
            delim = "XX")

Read diamonds_delim.csv back in using read_csv(). What happens?
Fix the delimiter issue by using read_delim() instead of read_csv() and specifying the delim-argument. Verify that everything is in order again by printing the data.

D - Read other file formats

Excel

Using read_excel(), read in the titanic.xls dataset as a new object called titanic (Make sure you have alredy loaded the readxl package at the beginning of your script).

titanic <- read_excel(path = "XX")

Print titanic and evaluate its dimensions using dim().
Using table(), how many people survived (variable survived) in each cabin class (variable pclass)?

# determine survival rate by cabin class
table(titanic$XX, 
      titanic$XX)

Using write_csv(), write the titanic dataframe as a new comma separated text file called titanic.csv in your 1_Data folder. Now you have the data saved as a text file any software can use!

SPSS

Using read_spss() read in the sleep data set sleep.sav of staff at he University of Melbourne as a new object called sleep. (Make sure that you have first loaded the haven package).

XX <- read_spss(file = "XX")

Print your sleep object and evaluate its dimensions using dim().
How many drinks do staff at the University of Melbourne consumer per day (variable alcohol). To do this, use the mean() function, while taking care of missing values using the na.rm argument.
Using the write_csv() function, write the sleep data to a new file called sleep.csv in your 1_Data folder. Now you have the sleep data stored as a text file any software can use!

SAS

Using read_sas(), read in airbnb_zuerich.sas7bdat containing AirBnB listings in Zürich, Switzerland and call the object airbnb_zuerich.

# read sas data
XX <- read_sas(data_file = "XX")

Print airbnb_zuerich and then evaluate its dimensions using dim().
How many AirBnB listings were there of each room_type in Zürich? (Hint: Using table())
Using write_csv() write your airbnb_zuerich data frame to as new comma-separated text file called airbnb_zuerich.csv in your 1_Data folder.

E - Use R’s data format

R has two data formats of its own - .Rdata and .RDS. We recommend to use only .RDS as its use follows the same logic as with other read and write functions. A benefit of using R’s data formats is compression, leading to often substantially smaller file sizes.

First, using read_csv() read back in the diamonds_clean.csv dataset as an object called diamonds_clean.
Using saveRDS() write diamonds_clean to an .RDS-file called diamonds_clean.RDS into your 1_Data folder. The function works just like the other write functions.

# write to RDS
saveRDS(object = XX, 
        file = "XX")

Using readRDS(), read the diamonds_clean.RDS back into R as a new object called diamonds_clean_rds. Then, print the diamonds_clean_rds object and see that it looks the same as the original diamonds_clean object.
Using the file.size() function, compare the file size of your diamonds_clean.csv and diamonds_clean.RDS files. Which file is larger?

# Check the file sizes of diamonds_clean.csv and  diamonds_clean.RDS
file.size("1_Data/XX")
file.size("1_Data/XX")

X - Scraping the internet

Finally on to something completely new. These days a lot of data is available not come in a ready-made, spreadsheet-like form, but must be extracted and processed in order to be used. One such data source is the world wide web. The code below based on the rvest and xml2 packages shows a short example of how to extract a table from R’s Wikipedia page. Run the code and experience how easy it is to extract data from a webpage.

# load package
library(tidyverse)
library(rvest)
library(xml2)

# get html
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)' 
page <- read_html(url)

# get raw html table
# use XPath from inspect page (e.g., in Chrome)
table_raw <- html_node(page, xpath = '//*[@id="mw-content-text"]/div/table[2]')

# transform to data frame
table <- html_table(table_raw)

# create tibble
as_tibble(table)

If you like, tweak the code to extract tables from other websites. The only things you need to change is the url and the xpath location. To extract the latter, use the feature of Chrome or Firefox to show you the html code for a given webpage element. To access the html code right-click on the website element of interest and select Inspect or Inspect element depending on whether you use Chrome or Firefox. Next right-click on the currently marked element in the html code pane and select Copy/Copy XPath or Copy/XPath again depending on the browser. Now, you can insert the XPath location into your code using cmd + v (Mac) or ctrl + v. If the code does not work, you probably selected the wrong element. Look out for an element called, e.g., <table class="wikitable">, in the case of Wikipedia.

Additional Resources

For more information on reading and writing see Grolemund`s and Wickham’s R for Data Science.

Data I/O

Basel R Bootcamp, July 2018
www.therbootcamp.com
@therbootcamp

Overview

Datasets

Packages

Glossary

Examples

Tasks

A - Getting setup

B - Read delim-separated text files

C - Write delim-separated text files

D - Read other file formats

Excel

SPSS

SAS

E - Use R’s data format

X - Scraping the internet

Additional Resources

Data I/O

Basel R Bootcamp, July 2018www.therbootcamp.com@therbootcamp

Overview

Datasets

Packages

Glossary

Examples

Tasks

A - Getting setup

B - Read delim-separated text files

C - Write delim-separated text files

D - Read other file formats

Excel

SPSS

SAS

E - Use R’s data format

X - Scraping the internet

Additional Resources

Basel R Bootcamp, July 2018
www.therbootcamp.com
@therbootcamp