In this practical you’ll learn how to read and save data. By the end of this practical you will know how to:
File | Rows | Columns | Description |
---|---|---|---|
diamonds.csv | 100 | 7 | Subset of the well-known diamonds data set containing specifications and prices of a large number of recorded diamonds. |
titanic.xls | 1309 | 14 | Information on the survival of titanic passengers. |
sleep.sav | 271 | 55 | Survey on sleeping behavior completed by staff at the University of Melbourne. |
airbnb_zuerich.sas7bdat | 2392 | 20 | Data on AirBnB listings in Zürich, Switzerland |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
readr |
install.packages("readr") |
haven |
install.packages("haven") |
readxl |
install.packages("readxl") |
xml2 |
install.packages("xml2") |
rvest |
install.packages("rvest") |
Reading/writing text data
Extension | File Type | Read | Write |
---|---|---|---|
.csv |
Comma-separated text | read_csv(file) |
write_csv(x, file) |
.csv |
Semi-colon separated text | read_csv2(file) |
not available |
.txt |
Other text | read_delim(file) |
write_delim(x, file) |
Reading/writing other data formats
Extension | File Type | Read | Write |
---|---|---|---|
.xls , .xlsx |
Excel | read_excel(file) |
xlsx::write.xlsx() |
.sav |
SPSS | read_spss(file) |
write_spss(x, file) |
.sas7bdat |
SAS | read_sas(file) |
write_sas(x, file) |
Processing websites
Function | Description |
---|---|
read_html(file) , write_html(x, file) |
Read/write websites files with extension .html |
hmlt_node(html) |
Select specific element in .html -file |
html_table |
Transform .html -table to data frame |
# load packages
library(tidyverse)
library(readxl)
library(haven)
# delim-separated -------------------
# read chickens data
chickens <- read_csv(file = "1_Data/chickens.csv")
# fix header of chickens_nohead.csv with known column names
chickens <- read_csv(file = "1_Data/chickens_nohead.csv",
col_names = c("weight", "time", "chick", "diet"))
# fix NA values of chickens_na.csv
chickens <- read_csv(file = "1_Data/chickens_na.csv",
na = c('NA', 'NULL'))
# write clean data to disc
write_csv(x = chickens,
path = "1_Data/chickens_clean.csv")
# fix types -------------------
# Note: the survey data is fictional!
# remove character from rating
survey$rating[survey$rating == "2,1"] <- 2.1
# rerun type convert
survey <- type_convert(survey)
# other formats -------------------
# .xlsx (Excel)
chickens <- read_excel("1_Data/chickens.xlsx")
# .sav (SPSS)
chickens <- read_spss("1_Data/chickens.sav")
# .sad7bdata (SAS)
chickens <- read_sas("1_Data/chickens.sas7bdat")
# scraping the web ----------------
# load packages for web scraping
library(rvest)
library(xml2)
# get html
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'
page <- read_html(url)
# get raw html table
# use XPath from inspect page (e.g., in Chrome)
table_raw <- html_node(page, xpath = '//*[@id="mw-content-text"]/div/table[2]')
# transform to data frame
table <- html_table(table_raw)
# create tibble
as_tibble(table)
Open your baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
.
Open a new R script and save it as a new file called dataio_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library()
.
In this section, you will read in a subset of the well known diamonds data set and prepare it for data analysis.
diamonds.csv
dataset by using the ""
(quotation marks) auto-complete trick. Place the cursor between two quotation marks, hit ⇥ (tab-key), and browse through the folders. Save the file path in an object called diamonds_path
.# place cursor in-between "" and hit tab
diamonds_path <- ""
read_csv()
and the diamonds_path
object, read the diamonds.csv
dataset as a new object called diamonds
.# read diamonds data
diamonds <- read_csv(file = XX)
Print the diamonds
data and inspect the column names in the header line. Something’s wrong!
Fix the header by reading in the data again using the col_names
-argument. Assign to col_names
a character vector containing the correct column names: carat
, cut
, color
, clarity
, depth
, table
, price
.
# read diamonds data while specifying col_names
diamonds <- read_csv(file = "XX",
col_names = XX) # Vector of column names
Re-inspect the header by printing the data. Has the header been fixed?
Now pay attention to the classes of the individuals columns (variables). Have all classes been identified correctly? What about the carat
column? It should be numeric
, right?
Let’s see what went wrong. Select and print the carat
variable to identify the one entry that caused the variable to become a character
vector (Hint: look for a comma).
Change the carat
variable’s element at position XX
to the same value YY
but with a period instead of a comma. Use the code below.
# Change the value at position XX to YY
diamonds$carat[XX] <- YY
Ok you fixed the value but carat
is still character
. We can fix it with the type_convert()
function. Apply the type_convert()
function to the diamonds
data to have R fix all the data types. Make sure to assign the result back to diamonds
so that you change the object!
Print the diamonds
object and look at the column types. Has the type of carat
changed to double
?
The data is now ready for analysis. Let’s calculate a few statistics. What is the average carat
or price
(use mean()
)? What cut
and clarity
levels exist and how often do they occur (use table()
)? You can learn more about the variable values from the help file ?diamonds
.
.csv
file using the name diamonds_clean.csv
. Don’t forget to include both the file name and the folder (separated by /
) in the character string specifying the path
argument.# write clean diamonds data to disc
write_csv(x = XX, path = "XX")
Read diamonds_clean.csv
back into R as a new object called diamonds_clean
. Then, print the object and verify that this time the types been correctly identified from the start.
Using write_delim
, write the data to disc with a different delimiter calling it diamonds_delim.csv
. Specifically, set the delim
-argument to "@"
.
# write with "@" as delimiter
write_delim(x = XX,
path = "XX",
delim = "XX")
Read diamonds_delim.csv
back in using read_csv()
. What happens?
Fix the delimiter issue by using read_delim()
instead of read_csv()
and specifying the delim
-argument. Verify that everything is in order again by printing the data.
read_excel()
, read in the titanic.xls
dataset as a new object called titanic
(Make sure you have alredy loaded the readxl
package at the beginning of your script).titanic <- read_excel(path = "XX")
Print titanic
and evaluate its dimensions using dim()
.
Using table()
, how many people survived (variable survived
) in each cabin class (variable pclass
)?
# determine survival rate by cabin class
table(titanic$XX,
titanic$XX)
write_csv()
, write the titanic
dataframe as a new comma separated text file called titanic.csv
in your 1_Data
folder. Now you have the data saved as a text file any software can use!read_spss()
read in the sleep data set sleep.sav
of staff at he University of Melbourne as a new object called sleep
. (Make sure that you have first loaded the haven
package).XX <- read_spss(file = "XX")
Print your sleep
object and evaluate its dimensions using dim()
.
How many drinks do staff at the University of Melbourne consumer per day (variable alcohol
). To do this, use the mean()
function, while taking care of missing values using the na.rm
argument.
Using the write_csv()
function, write the sleep
data to a new file called sleep.csv
in your 1_Data
folder. Now you have the sleep
data stored as a text file any software can use!
read_sas()
, read in airbnb_zuerich.sas7bdat
containing AirBnB listings in Zürich, Switzerland and call the object airbnb_zuerich
.# read sas data
XX <- read_sas(data_file = "XX")
Print airbnb_zuerich
and then evaluate its dimensions using dim()
.
How many AirBnB listings were there of each room_type
in Zürich? (Hint: Using table()
)
Using write_csv()
write your airbnb_zuerich
data frame to as new comma-separated text file called airbnb_zuerich.csv
in your 1_Data
folder.
R has two data formats of its own - .Rdata
and .RDS
. We recommend to use only .RDS
as its use follows the same logic as with other read and write functions. A benefit of using R’s data formats is compression, leading to often substantially smaller file sizes.
First, using read_csv()
read back in the diamonds_clean.csv
dataset as an object called diamonds_clean
.
Using saveRDS()
write diamonds_clean
to an .RDS
-file called diamonds_clean.RDS
into your 1_Data
folder. The function works just like the other write functions.
# write to RDS
saveRDS(object = XX,
file = "XX")
Using readRDS()
, read the diamonds_clean.RDS
back into R as a new object called diamonds_clean_rds
. Then, print the diamonds_clean_rds
object and see that it looks the same as the original diamonds_clean
object.
Using the file.size()
function, compare the file size of your diamonds_clean.csv
and diamonds_clean.RDS
files. Which file is larger?
# Check the file sizes of diamonds_clean.csv and diamonds_clean.RDS
file.size("1_Data/XX")
file.size("1_Data/XX")
rvest
and xml2
packages shows a short example of how to extract a table from R’s Wikipedia page. Run the code and experience how easy it is to extract data from a webpage.# load package
library(tidyverse)
library(rvest)
library(xml2)
# get html
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'
page <- read_html(url)
# get raw html table
# use XPath from inspect page (e.g., in Chrome)
table_raw <- html_node(page, xpath = '//*[@id="mw-content-text"]/div/table[2]')
# transform to data frame
table <- html_table(table_raw)
# create tibble
as_tibble(table)
url
and the xpath
location. To extract the latter, use the feature of Chrome or Firefox to show you the html code for a given webpage element. To access the html code right-click on the website element of interest and select Inspect
or Inspect element
depending on whether you use Chrome or Firefox. Next right-click on the currently marked element in the html code pane and select Copy/Copy XPath
or Copy/XPath
again depending on the browser. Now, you can insert the XPath
location into your code using cmd + v
(Mac) or ctrl + v
. If the code does not work, you probably selected the wrong element. Look out for an element called, e.g., <table class="wikitable">
, in the case of Wikipedia.