Overview

In this practical you’ll learn how to work with basic data objects and functions. By the end of this practical you will know how to:

Create vectors of different types using c()
Understand the three main vector classes numeric, character, and logical using class()
Calculate basic descriptive statistics using mean(), median(), table() (and more!)
Read and write data of various data formats using read_csv() and others
Access and change vectors from data frames using $
Create data.frames and tibbles! using data.frame() and tibble()

Datasets

library(tidyverse)
library(haven)

File	Rows	Columns	Description
diamonds.csv	100	7	Subset of the well-known diamonds data set containing specifications and prices of a large number of recorded diamonds.
titanic.xls	1309	14	Information on the survival of titanic passengers.
sleep.sav	271	55	Survey on sleeping behavior completed by staff at the University of Melbourne.
airbnb_zuerich.sas7bdat	2392	20	Data on AirBnB listings in Zürich, Switzerland

Packages

Package	Installation
`tidyverse`	`install.packages("tidyverse")`
`readr`	`install.packages("readr")`
`haven`	`install.packages("haven")`
`readxl`	`install.packages("readxl")`

Glossary

Creating vectors

Function	Description
`c("a", "b", "c")`	Create a character vector
`c(1, 2, 3)`	Create a numeric vector
`c(TRUE, FALSE, TRUE)`	Create a logical vector

Vector functions

Function	Description
`mean(x), median(x), sd(x), sum(x)`	Mean, median standard deviation, sum
`max(x), min(x)`	Maximum, minimum
`table(x)`	Table of frequency counts
`round(x, digits)`	Round a numeric vector x to `digits`

Accessing vectors from data frames

Function	Description
`df$name`	Access vector `name` from a data frame `df`

Reading/writing text data

Extension	File Type	Read	Write
`.csv`	Comma-separated text	`read_csv(file)`	`write_csv(x, file)`
`.csv`	Semi-colon separated text	`read_csv2(file)`	not available
`.txt`	Other text	`read_delim(file)`	`write_delim(x, file)`

Reading/writing other data formats

Extension	File Type	Read	Write
`.xls`, `.xlsx`	Excel	`read_excel(file)`	`xlsx::write.xlsx()`
`.sav`	SPSS	`read_spss(file)`	`write_spss(x, file)`
`.sas7bdat`	SAS	`read_sas(file)`	`write_sas(x, file)`

Creating data frames from vectors

Function	Description
`data.frame(a, b, c)`	Create a data frame from vectors a, b, c
`tibble(a, b, c)`	Create a tibble from vectors a, b, c

Examples

library(tidyverse)
library(readr)
library(readxl)
library(haven)

# Create vectors of (fake) stock data
name      <- c("apple", "microsoft", "dell", "google", "twitter")
yesterday <- c(100, 89, 65, 54, 89)
today     <- c(102, 85, 72, 60, 95)

# Summary statistics
mean(today)
mean(yesterday)

# Show classes
class(name)
class(yesterday)

# Operations of vectors
change <- today - yesterday
change # Print result

# Create a logical vector from two numerics
increase <- today > yesterday
increase # Print result

# Create a tibble combining multiple vectors
stocks <- tibble(name, yesterday, today, change, increase)

# Get column names
names(stocks)

# Access columns by name
stocks$name
stocks$today

# Calculate descriptives on columns
mean(stocks$yesterday)
median(stocks$today)
table(stocks$increase)
max(stocks$increase)


# read/write delim-separated -------------------

# read chickens data
chickens <- read_csv(file = "1_Data/chickens.csv")

# fix header of chickens_nohead.csv with known column names
chickens <- read_csv(file = "1_Data/chickens_nohead.csv",
                     col_names = c("weight", "time", "chick", "diet"))

# fix NA values of chickens_na.csv
chickens <- read_csv(file = "1_Data/chickens_na.csv",
                     na = c('NA', 'NULL'))

# write clean data to disc
write_csv(x = chickens, 
          path = "1_Data/chickens_clean.csv")

# fix types -------------------
# Note: the survey data is fictional!

# remove character from rating
survey$rating[survey$rating == "2,1"] <- 2.1

# rerun type convert
survey <- type_convert(survey)

# other formats -------------------

# .xlsx (Excel)
chickens <- read_excel("1_Data/chickens.xlsx")

# .sav (SPSS)
chickens <- read_spss("1_Data/chickens.sav")

# .sad7bdata (SAS)
chickens <- read_sas("1_Data/chickens.sas7bdat")

Tasks

A - Getting setup

Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code.
Open a new R script and save it as a new file called data_practical.R in the 2_Code folder. At the top of the script, using comments, write your name and the date. Then, load all package(s) listed in the Packages section above with library(). Make sure that each of the datasets listed above lie in your 1_Data folder.

B - Creating vectors

The table below shows results from a (fictional) survey of 10 Baselers. In the first part of this practical, you will convert this table to R objects and then analyse them!

id	sex	age	height	weight
1	male	44	174.3	113.4
2	male	65	180.3	75.2
3	female	31	168.3	55.5
4	male	27	209	93.8
5	male	24	176.7
6	male	63	186.6	67.4
7	male	71	151.6	83.3
8	female	41	155.7	67.8
9	male	43	176.1	69.3
10	female	31	166.1	66.3

Create a numeric vector called id that shows the id values. When you finish, print the vector object to see it!

# Create a vector id
XX <- c(XX, XX, ...)

# Print the vector id
XX

# Create an id vector 
id <- 1:10 # shortcut to creating the sequence from 1 to 10

# Print the vector
id

##  [1]  1  2  3  4  5  6  7  8  9 10

Using the class() function, check the class of your id vector. Is it "numeric"?

# Show the class of an object XX
class(XX)

# Show the class of the id vector
class(id)

## [1] "integer"

Using the length() function, find out the length of your id vector. Does it have length 10? If not, make sure you defined it correctly!

# Show the length of the id vector
length(XX)

# Show the length of the id vector
length(id)

## [1] 10

Create a character vector called sex that shows the sex values. Make sure to use quotation marks “” to enclose each element to tell R that the data are of type "character"! When you finish, print the object to see it!

# Create a character vector sex
XX <- c("XX", "XX", "...")

# Create a sex vector 
sex <- c("male", "male", "female", "male", "male", "male", "male", "female", "male", "female")

# Print the vector
sex

##  [1] "male"   "male"   "female" "male"   "male"   "male"   "male"  
##  [8] "female" "male"   "female"

Using the class() function, check the class of your sex vector. Is it "character"?

# Show the class of the sex vector
class(sex)

## [1] "character"

Using the length() function, find out the length of your sex object. Does it have length 10? If not, make sure you defined it correctly!

# Show the length of the sex vector
length(sex)

## [1] 10

Using the same steps as before, create a age and height vector.

# Create a age vector 
age <- c(44, 65, 31, 27, 24, 63, 71, 41, 43, 31)

# Print the age vector
age

##  [1] 44 65 31 27 24 63 71 41 43 31

# Show the class of the age vector
class(age)

## [1] "numeric"

# Show the length of the age vector
length(age)

## [1] 10

# Create a height vector 
height <- c(174.3, 180.3, 168.3, 209, 176.7, 186.6, 151.6, 155.7, 176.1, 166.1)

# Print the height vector
height

##  [1] 174 180 168 209 177 187 152 156 176 166

# Show the class of the height vector
class(height)

## [1] "numeric"

# Show the length of the height vector
length(height)

## [1] 10

Look at the weight data, you’ll notice it contains an missing value. Create a vector called weight containing these data, following the same steps as before, making sure to specify the missing value as NA (no quotation marks).

# Create a weight vector 
weight <- c(113.4, 75.2, 55.5, 93.8, NA, 67.4, 83.3, 67.8, 69.3, 66.3)

# Print the weight vector
weight

##  [1] 113.4  75.2  55.5  93.8    NA  67.4  83.3  67.8  69.3  66.3

# Show the class of the weight vector
class(weight)

## [1] "numeric"

# Show the length of the weight vector
length(weight)

## [1] 10

C - Functions

Using the table() function, find out how many males and females are in the data. You should find 7 males and 3 females!

# Count types in sex
table(sex)

## sex
## female   male 
##      3      7

Using the mean() function, calculate the mean age. It should be 44!

# Compute mean of age
mean(age)

## [1] 44

Try calculating the mean value of sex. What happens? Why?

# Compute mean of sex
mean(sex)

## [1] NA

Try calculating the mean weight. You should get an NA value. Why?

# Compute mean of sex
mean(weight)

## [1] NA

Look at the help menu for the mean() function (using ?mean) to look for an argument that will help you with your problem.

# Inspect help for mean
?mean

Using the correct argument for the mean function, calculate the mean weight ignoring NA values. It should be 76.89!

# Compute mean weight, ignoring NAs 
mean(weight, na.rm = TRUE)

## [1] 76.9

D - Read & write delim-separated text files

In this section, you will read in a subset of the well known diamonds data set and prepare it for data analysis.

Identify the file path to the diamonds.csv dataset using the "" (quotation marks) auto-complete trick. Place the cursor between two quotation marks, hit ⇥ (tab-key), and browse through the folders. Save the file path, for now, in an object called diamonds_path.

# place cursor in-between "" and hit tab
diamonds_path <- ""

# place cursor in-between and hit tab
diamonds_path <- "1_Data/diamonds.csv"

Now use the Using diamonds_path insdide the read_csv() function to read in the diamonds.csv dataset. Store it as a new object called diamonds.

# read diamonds data
diamonds <- read_csv(file = XX)

# read diamonds data
diamonds <- read_csv(file = diamonds_path)

Print the diamonds data and inspect the column names in the header line. Something’s wrong!

# print diamonds
diamonds

## # A tibble: 99 x 7
##    `0.8` `Very Good` H     VVS1  `62.9`  `58` `4468`
##    <chr> <chr>       <chr> <chr>  <dbl> <dbl>  <int>
##  1 0.74  Ideal       H     IF      60.9    57   3760
##  2 2.03  Premium     I     SI1     61.4    58  15683
##  3 0.41  Ideal       G     VVS1    62.1    55   1151
##  4 1.54  Premium     G     VS1     61.1    56  14438
##  5 0.3   Ideal       E     VS2     61.8    55    795
##  6 0.3   Ideal       H     VVS2    61.5    56    605
##  7 1.2   Ideal       D     SI1     61.8    58   7508
##  8 0.58  Ideal       E     VS2     62.3    54   1809
##  9 0.31  Ideal       H     VS2     62.6    57    489
## 10 1.24  Very Good   F     VS1     59      60   9885
## # ... with 89 more rows

Fix the header by reading in the data again using the col_names-argument. Assign to col_names a character vector containing the correct column names: carat, cut, color, clarity, depth, table, price.

# read diamonds data with specified col_names
diamonds <- read_csv(file = "XX", 
                     col_names = c('name_1','name_2','...'))  # Vector of column names

# read diamonds data with specified col_names
diamonds <- read_csv(file = diamonds_path,
                     col_names = c("carat", "cut", "color", "clarity", "depth", "table", "price"))

Re-inspect the header by printing the data. Has the header been fixed?

# print diamonds
diamonds

## # A tibble: 100 x 7
##    carat cut       color clarity depth table price
##    <chr> <chr>     <chr> <chr>   <dbl> <dbl> <int>
##  1 0.8   Very Good H     VVS1     62.9    58  4468
##  2 0.74  Ideal     H     IF       60.9    57  3760
##  3 2.03  Premium   I     SI1      61.4    58 15683
##  4 0.41  Ideal     G     VVS1     62.1    55  1151
##  5 1.54  Premium   G     VS1      61.1    56 14438
##  6 0.3   Ideal     E     VS2      61.8    55   795
##  7 0.3   Ideal     H     VVS2     61.5    56   605
##  8 1.2   Ideal     D     SI1      61.8    58  7508
##  9 0.58  Ideal     E     VS2      62.3    54  1809
## 10 0.31  Ideal     H     VS2      62.6    57   489
## # ... with 90 more rows

Now pay attention to the classes of the individuals columns (variables). Have all classes been identified correctly? What about the carat column? It should be numeric, right?

# print diamonds
diamonds

## # A tibble: 100 x 7
##    carat cut       color clarity depth table price
##    <chr> <chr>     <chr> <chr>   <dbl> <dbl> <int>
##  1 0.8   Very Good H     VVS1     62.9    58  4468
##  2 0.74  Ideal     H     IF       60.9    57  3760
##  3 2.03  Premium   I     SI1      61.4    58 15683
##  4 0.41  Ideal     G     VVS1     62.1    55  1151
##  5 1.54  Premium   G     VS1      61.1    56 14438
##  6 0.3   Ideal     E     VS2      61.8    55   795
##  7 0.3   Ideal     H     VVS2     61.5    56   605
##  8 1.2   Ideal     D     SI1      61.8    58  7508
##  9 0.58  Ideal     E     VS2      62.3    54  1809
## 10 0.31  Ideal     H     VS2      62.6    57   489
## # ... with 90 more rows

Let’s see what went wrong. Select and print the carat variable to identify the one entry that caused the variable to become a character vector (Hint: look for a comma between entry 10 and 20).

# print the carat column
diamonds$carat

##   [1] "0.8"  "0.74" "2.03" "0.41" "1.54" "0.3"  "0.3"  "1.2"  "0.58" "0.31"
##  [11] "1.24" "0.91" "1.28" "0.31" "1.02" "1"    "0,37" "0.55" "0.54" "0.34"
##  [21] "0.91" "0.9"  "0.5"  "0.31" "1.66" "0.47" "0.3"  "0.7"  "1.72" "0.41"
##  [31] "1.06" "0.32" "0.4"  "0.71" "0.3"  "1.31" "1.08" "0.45" "0.3"  "0.62"
##  [41] "1.01" "2"    "0.38" "2.03" "1"    "0.38" "0.41" "0.49" "0.71" "1.51"
##  [51] "1.02" "1.3"  "0.32" "1.52" "0.59" "1.31" "1.05" "1.08" "0.43" "1.08"
##  [61] "0.3"  "0.4"  "0.52" "0.41" "1"    "0.33" "0.75" "0.26" "0.34" "1.49"
##  [71] "0.3"  "0.4"  "0.71" "0.92" "0.7"  "0.55" "1.47" "0.42" "0.58" "0.44"
##  [81] "0.31" "0.3"  "0.55" "0.41" "0.31" "0.33" "0.32" "2.67" "0.88" "0.57"
##  [91] "0.36" "0.53" "0.79" "0.9"  "0.31" "1.03" "0.39" "0.51" "0.34" "0.25"

Change the incorrectly formated entry in carat by replacing XX with the index of the incorrect value (i.e., the correct number between 10 and 20) and YY with the correct entry with a period (.) instead of a comma (,) in the code below.

# Change the value at position XX to YY
diamonds$carat[XX] <- YY

# Change the value at position XX to YY
diamonds$carat[17] <- 0.37

Ok you fixed the value but carat is still character. We can fix it with the type_convert() function. Apply the type_convert() function to the diamonds data to have R fix all the data types. Make sure to assign the result back to diamonds so that you change the object!

# re-infer data types
diamonds <- type_convert(diamonds)

Print the diamonds object and look at the column types. Has the type of carat changed to double?

# print diamonds data set
diamonds

## # A tibble: 100 x 7
##    carat cut       color clarity depth table price
##    <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int>
##  1 0.8   Very Good H     VVS1     62.9    58  4468
##  2 0.74  Ideal     H     IF       60.9    57  3760
##  3 2.03  Premium   I     SI1      61.4    58 15683
##  4 0.41  Ideal     G     VVS1     62.1    55  1151
##  5 1.54  Premium   G     VS1      61.1    56 14438
##  6 0.3   Ideal     E     VS2      61.8    55   795
##  7 0.3   Ideal     H     VVS2     61.5    56   605
##  8 1.2   Ideal     D     SI1      61.8    58  7508
##  9 0.580 Ideal     E     VS2      62.3    54  1809
## 10 0.31  Ideal     H     VS2      62.6    57   489
## # ... with 90 more rows

Write the, now, properly formatted diamonds data to your data folder as a .csv file using the name diamonds_clean.csv. Don’t forget to include both the file name and the folder (separated by /) in the character string specifying the path argument.

# write clean diamonds data to disc
write_csv(x = XX, path = "XX")

# write clean diamonds data to disc
write_csv(x = diamonds, "1_Data/diamonds_clean.csv")

Read diamonds_clean.csv back into R as a new object called diamonds_clean. Then, print the object and verify that this time the types been correctly identified from the start.

# read clean diamonds data from disc
diamonds_clean <- read_csv(file = "1_Data/diamonds_clean.csv")

The data is now ready for analysis. Explore it a bit by calculating a few statistics. For instance, what is the average carat or price (use mean())? What cut and clarity levels exist and how often do they occur (use table() on both variables)? You can learn more about the variable values from the help file ?diamonds.

# simple stats of diamonds
mean(diamonds$carat)

## [1] 0.742

mean(diamonds$price)

## [1] 3543

table(diamonds$cut)

## 
##      Fair      Good     Ideal   Premium Very Good 
##         2         5        46        31        16

table(diamonds$clarity)

## 
##   I1   IF  SI1  SI2  VS1  VS2 VVS1 VVS2 
##    1    4   17   19   15   24   10   10

E - Logical Vectors and `$`

Logical vectors contain as values only TRUE and FALSE (and NAs). Create a new logical vector called expensive indicating which diamonds are more expensive than $10000. To do this, select the price variable from the data frame using $ use the > (greater than) operator á la vector > value.

# Create a logical vector expensive indicating
# which dimaonds cost more than 10,000

ZZ <- diamonds$XX > YY

# Create a logical vector expensive indicating
# which dimaonds cost more than 10,000

expensive <- diamonds$price > 10000

Print your expensive vector to the console. Do you see only TRUE and FALSE values? If so, do the first few values match those in the price variable?

# print expensive
expensive

##   [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
##  [45] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE

Add your expensive vector to the diamonds data frame using data_frame$variable_name <- variable. See below?

# add vector to data frame
XX$YY <- ZZ

# add vector to data frame
diamonds$expensive <- expensive

Using the table() function, create a table showing how many of the diamonds are expensive how many are not. Select the variable directly from the data frame using $.

# count expensive diamonds
table(diamonds$expensive)

## 
## FALSE  TRUE 
##    92     8

Using the mean() function, determine the percentage of the diamonds that are expensive, i.e., mean(expensive). Should this have worked?

# percentage of expensive diamonds
mean(diamonds$expensive)

## [1] 0.08

What percent diamonds weigh more than 1 carat (i.e., more than .2 gram)?

# percentage of diamonds heavier than 1 carat
mean(diamonds$carat > 1)

## [1] 0.26

F - Read other file formats

Excel

Using read_excel(), read in the titanic.xls dataset as a new object called titanic (Make sure you have alredy loaded the readxl package at the beginning of your script).

titanic <- read_excel(path = "XX")

# Read titanic data
titanic <- read_excel("1_Data/titanic.xls")

Print titanic and evaluate its dimensions using dim().

# print and show dimenisons
titanic

## # A tibble: 1,309 x 14
##    pclass survived name  sex      age sibsp parch ticket  fare cabin
##     <dbl>    <dbl> <chr> <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
##  1      1        1 Alle… fema… 29         0     0 24160  211.  B5   
##  2      1        1 Alli… male   0.917     1     2 113781 152.  C22 …
##  3      1        0 Alli… fema…  2         1     2 113781 152.  C22 …
##  4      1        0 Alli… male  30         1     2 113781 152.  C22 …
##  5      1        0 Alli… fema… 25         1     2 113781 152.  C22 …
##  6      1        1 Ande… male  48         0     0 19952   26.6 E12  
##  7      1        1 Andr… fema… 63         1     0 13502   78.0 D7   
##  8      1        0 Andr… male  39         0     0 112050   0   A36  
##  9      1        1 Appl… fema… 53         2     0 11769   51.5 C101 
## 10      1        0 Arta… male  71         0     0 PC 17…  49.5 <NA> 
## # ... with 1,299 more rows, and 4 more variables: embarked <chr>,
## #   boat <chr>, body <dbl>, home.dest <chr>

dim(titanic)

## [1] 1309   14

Using table(), how many people survived (variable survived) in each cabin class (variable pclass)?

# determine survival rate by cabin class
table(titanic$XX, 
      titanic$XX)

# determine survival rate by cabin class
table(titanic$pclass, titanic$survived)

##    
##       0   1
##   1 123 200
##   2 158 119
##   3 528 181

Using write_csv(), write the titanic dataframe as a new comma separated text file called titanic.csv in your 1_Data folder. Now you have the data saved as a text file any software can use!

# write data to .csv
write_csv(x = titanic, 
          path = "1_Data/titanic.csv")

SPSS

Using read_spss() read in the sleep data set sleep.sav of staff at he University of Melbourne as a new object called sleep. (Make sure that you have first loaded the haven package).

XX <- read_spss(file = "XX")

# Read sleep data
sleep <- read_spss(file = "1_Data/sleep.sav")

Print your sleep object and evaluate its dimensions using dim().

# print and show dimensions
sleep

## # A tibble: 271 x 55
##       id sex     age marital edlevel weight height healthrate fitrate
##    <dbl> <dbl> <dbl> <dbl+l> <dbl+l>  <dbl>  <dbl> <dbl+lbl>  <dbl+l>
##  1    83 0        42 2       2           52    162 10         7      
##  2   294 0        54 2       5           65    174 " 8"       7      
##  3   425 1        NA 2       2           89    170 " 6"       5      
##  4    64 0        41 2       5           66    178 " 9"       7      
##  5   536 0        39 2       5           62    160 " 9"       5      
##  6    57 0        66 2       4           62    165 " 8"       8      
##  7   251 0        36 1       3           62    165 " 9"       7      
##  8   255 0        35 2       5           75    174 " 6"       6      
##  9   265 1        NA 2       5           90    180 " 6"       6      
## 10   290 1        41 2       5           75    187 " 9"       9      
## # ... with 261 more rows, and 46 more variables: weightrate <dbl+lbl>,
## #   smoke <dbl+lbl>, smokenum <dbl>, alchohol <dbl>, caffeine <dbl>,
## #   hourwnit <dbl>, hourwend <dbl>, hourneed <dbl>, trubslep <dbl+lbl>,
## #   trubstay <dbl+lbl>, wakenite <dbl+lbl>, niteshft <dbl+lbl>,
## #   liteslp <dbl+lbl>, refreshd <dbl+lbl>, satsleep <dbl+lbl>,
## #   qualslp <dbl+lbl>, stressmo <dbl+lbl>, medhelp <dbl+lbl>,
## #   problem <dbl+lbl>, impact1 <dbl+lbl>, impact2 <dbl+lbl>,
## #   impact3 <dbl+lbl>, impact4 <dbl+lbl>, impact5 <dbl+lbl>,
## #   impact6 <dbl+lbl>, impact7 <dbl+lbl>, stopb <dbl+lbl>,
## #   restlss <dbl+lbl>, drvsleep <dbl+lbl>, drvresul <dbl+lbl>, ess <dbl>,
## #   anxiety <dbl>, depress <dbl>, fatigue <dbl>, lethargy <dbl>,
## #   tired <dbl>, sleepy <dbl>, energy <dbl>, stayslprec <dbl+lbl>,
## #   getsleprec <dbl+lbl>, qualsleeprec <dbl+lbl>, totsas <dbl>,
## #   cigsgp3 <dbl+lbl>, agegp3 <dbl+lbl>, probsleeprec <dbl+lbl>,
## #   drvslprec <dbl+lbl>

dim(sleep)

## [1] 271  55

How many drinks do staff at the University of Melbourne consumer per day (variable alcohol). To do this, use the mean() function, while taking care of missing values using the na.rm argument.

# compute mean number of drinks
mean(x = sleep, na.rm = TRUE)

## [1] NA

Using the write_csv() function, write the sleep data to a new file called sleep.csv in your 1_Data folder. Now you have the sleep data stored as a text file any software can use!

# write data to .csv
write_csv(x = sleep, 
          path = "1_Data/sleep.csv")

SAS

Using read_sas(), read in airbnb_zuerich.sas7bdat containing AirBnB listings in Zürich, Switzerland and call the object airbnb_zuerich.

# read sas data
XX <- read_sas(data_file = "XX")

# read airbnb_zuerich.sas7bdat
airbnb_zuerich <- read_sas(data_file = "1_Data/airbnb_zuerich.sas7bdat")

Print airbnb_zuerich and then evaluate its dimensions using dim().

# print and show dimenisons
airbnb_zuerich

## # A tibble: 2,392 x 20
##    room_id survey_id host_id room_type country city  borough neighborhood
##      <dbl>     <dbl>   <dbl> <chr>     <chr>   <chr> <chr>   <chr>       
##  1  1.37e7      1363  5.63e7 Entire h… ""      Zuri… ""      Kreis 12    
##  2  8.00e6      1363  1.65e7 Entire h… ""      Zuri… ""      Kreis 7     
##  3  1.52e7      1363  5.03e7 Entire h… ""      Zuri… ""      Kreis 4     
##  4  7.56e6      1363  4.92e6 Entire h… ""      Zuri… ""      Kreis 1     
##  5  1.86e7      1363  2.04e7 Entire h… ""      Zuri… ""      Kreis 12    
##  6  6.44e6      1363  1.24e7 Entire h… ""      Zuri… ""      Kreis 2     
##  7  1.88e6      1363  1.60e6 Entire h… ""      Zuri… ""      Kreis 7     
##  8  3.63e6      1363  1.83e7 Entire h… ""      Zuri… ""      Kreis 8     
##  9  1.44e7      1363  5.19e7 Entire h… ""      Zuri… ""      Kreis 2     
## 10  1.28e7      1363  2.64e5 Entire h… ""      Zuri… ""      Kreis 2     
## # ... with 2,382 more rows, and 12 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <chr>, price <dbl>, minstay <chr>, name <chr>,
## #   last_modified <dttm>, latitude <dbl>, longitude <dbl>, location <chr>

dim(airbnb_zuerich)

## [1] 2392   20

How many AirBnB listings were there of each room_type in Zürich? (Hint: use table())

# table room type
table(airbnb_zuerich$room_type)

## 
## Entire home/apt    Private room     Shared room 
##            1386             975              31

Using write_csv() write your airbnb_zuerich data frame to as new comma-separated text file called airbnb_zuerich.csv in your 1_Data folder.

# write data to .csv
write_csv(x = airbnb_zuerich, 
          path = "1_Data/airbnb_zuerich.csv")

G - Creating data frames

Using the data.frame() function, create a data frame called ten_df that contains each of vectors you just created: id, age, sex, height, weight.

# Create data frame ten_df containing vectors id, age, sex, height, and weight.
XX <- data.frame(XX, XX, XX, XX, XX, XX)

# Create ten_df data frame from vectors
ten_df <- data.frame(id, age, sex, height, weight)

Print your ten_df object to see how it looks! Does it contain all of the vectors?

# Print ten_df
ten_df

##    id age    sex height weight
## 1   1  44   male    174  113.4
## 2   2  65   male    180   75.2
## 3   3  31 female    168   55.5
## 4   4  27   male    209   93.8
## 5   5  24   male    177     NA
## 6   6  63   male    187   67.4
## 7   7  71   male    152   83.3
## 8   8  41 female    156   67.8
## 9   9  43   male    176   69.3
## 10 10  31 female    166   66.3

Using the dim() function, print the number of rows and columns in your data frame. Do you get 10 rows and 5 columns?

# Inspect dimensions
dim(ten_df)

## [1] 10  5

What is the class of your ten_df object? Use the class() function to find out!

# Inspect class
class(ten_df)

## [1] "data.frame"

Use the summary() function to print descriptive statistics from each column of ten_df

# Inspect class
summary(ten_df)

##        id             age           sex        height        weight     
##  Min.   : 1.00   Min.   :24.0   female:3   Min.   :152   Min.   : 55.5  
##  1st Qu.: 3.25   1st Qu.:31.0   male  :7   1st Qu.:167   1st Qu.: 67.4  
##  Median : 5.50   Median :42.0              Median :175   Median : 69.3  
##  Mean   : 5.50   Mean   :44.0              Mean   :174   Mean   : 76.9  
##  3rd Qu.: 7.75   3rd Qu.:58.2              3rd Qu.:179   3rd Qu.: 83.3  
##  Max.   :10.00   Max.   :71.0              Max.   :209   Max.   :113.4  
##                                                          NA's   :1

Using the $ operator, print the age column from the ten_df data frame.

# Inspect age
ten_df$age

##  [1] 44 65 31 27 24 63 71 41 43 31

Calculate the maximum age value from the ten_df data frame using max(). Do you get the same result from when you calculated it from the original vector age?

# Get max
max(ten_df$age)

## [1] 71

Instead of creating a data frame of the data using the data.frame() function, try creating a tibble called ten_tibble using the tibble() function. tibbles are a more modern, leaner variant of data frame that we prefer over classic data.frames You can use the exact same arguments you used before.

# create tibble
ten_tibble = tibble(id, sex, height, weight)

Print your new ten_tibble object, how does it look different from ten_df? Try calculating the maximum age from this object. Is it different from what you got before?

# print tibble
ten_tibble

## # A tibble: 10 x 4
##       id sex    height weight
##    <int> <chr>   <dbl>  <dbl>
##  1     1 male     174.  113. 
##  2     2 male     180.   75.2
##  3     3 female   168.   55.5
##  4     4 male     209    93.8
##  5     5 male     177.   NA  
##  6     6 male     187.   67.4
##  7     7 male     152.   83.3
##  8     8 female   156.   67.8
##  9     9 male     176.   69.3
## 10    10 female   166.   66.3

max(ten_tibble$age) == max(ten_df$age)

## [1] FALSE

X - Challenges

If you take the sum() of a logical vector, R will return the number of cases that are TRUE. Using this, find out how many of the ten Baselers are male while using the is-equal-to operator ==.

# Determine the frequency of a case in a vector
sum(XX == XX)

# Determine the frequency of a case in a vector
sum(ten_tibble == "male")

## [1] NA

You can use logical vectors to select rows from a data frame based on certain criteria. using the following template, get the id values of Baselers who are younger than 30:

# Create a logical vector indicating which baselers are younger than 30
young_30 <- XX$XX < 30

# Print the ids of baselers younger than 30
XX$XX[young_30]

# Create a logical vector indicating which baselers are younger than 30
young_30 <- ten_tibble$age < 30

# Print the ids of baselers younger than 30
ten_tibble$id[young_30]

## integer(0)

Use a combination of logical vectors and the mean() function to answer the question: “What is the mean age of Baselers who are heavier than 80kg?”

# Mean age of baselers heavier than 80kg
mean(ten_tibble$age[ten_tibble$weight > 80])

## [1] NA

What are the id values of Baselers who are male and are shorter than 165cm? (Hint: You will need to use the logical AND operator & to combine multiple logical vectors)

# Mean age of baselers heavier than 80kg
ten_tibble$id[ten_tibble$sex == "male" & ten_tibble$height < 165]

## [1] 7

Additional Resources

For more information on the fundamentals of object and functions in R see the R Core team’s introduction to R and for even more advanced object and function-related topics Hadley Wickham’s Advanced R.
For more information on reading and writing (and everything else) see Grolemund`s and Wickham’s R for Data Science.

Data

Introduction to Data Science with Rwww.therbootcamp.com@therbootcamp