R for Data Science Basel R Bootcamp |
from methodspace.com
In this practical you’ll learn how to work with basic data objects and functions. By the end of this practical you will know how to:
c()
class()
mean()
, median()
, table()
(and more!)read_csv()
and others$
data.frames
and tibbles
using data.frame()
and tibble()
Open your BaselRBootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that each of the datasets listed in the Datasets section above lie in your 1_Data
folder.
Open a new R script and save it as a new file called data_practical.R
in the 2_Code
folder. At the top of the script, using comments, write your name and the date.
Load the following package(s) with library()
. If you don’t have any installed on your computer, look at the Functions section above for installation instructions.
# Load packages
library(tidyverse)
library(haven)
library(readxl)
The table below shows results from a (fictional) survey of 5 Baselers. In the first part of this practical, you will convert this table to R objects and then analyse them!
5 Baselers Table
id | sex | age | height | weight |
---|---|---|---|---|
1 | m | 44 | 174 | 113 |
2 | m | 65 | 180 | 75 |
3 | f | 31 | 168 | 56 |
4 | m | 27 | 209 | 94 |
5 | m | 24 | 177 |
id
that shows the id values from the 5 Baselers table. When you finish, print the vector object to see it!# Create a vector id
XX <- c(XX, XX, ...)
# Print the vector id
XX
# Create an id vector
id <- 1:5 # shortcut to creating the sequence from 1 to 5
# Print the vector
id
## [1] 1 2 3 4 5
class()
function, check the class of your id
vector. Is it "numeric"
?# Show the class of an object XX
class(XX)
# Show the class of the id vector
class(id)
## [1] "integer"
length()
function, find out the length of your id
vector. Does it have length 5? If not, make sure you defined it correctly!# Show the length of the id vector
length(XX)
# Show the length of the id vector
length(id)
## [1] 5
sex
that shows the sex values from the 5 Baselers table. Make sure to use quotation marks "" to enclose each element to tell R that the data are of type "character"
! When you finish, print the object to see it!# Create a character vector sex
XX <- c("XX", "XX", "...")
# Create a sex vector
sex <- c("m", "m", "f", "m", "m")
# Print the vector
sex
## [1] "m" "m" "f" "m" "m"
class()
function, check the class of your sex
vector. Is it "character"
?# Show the class of the sex vector
class(sex)
## [1] "character"
length()
function, find out the length of your sex
object. Does it have length 5? If not, make sure you defined it correctly!# Show the length of the sex vector
length(sex)
## [1] 5
age
and height
vectors from the 5 Baselers table.# Create a age vector
age <- c(44, 65, 31, 27, 24)
# Print the age vector
age
## [1] 44 65 31 27 24
# Show the class of the age vector
class(age)
## [1] "numeric"
# Show the length of the age vector
length(age)
## [1] 5
# Create a height vector
height <- c(174, 180, 168, 209, 177)
# Print the height vector
height
## [1] 174 180 168 209 177
# Show the class of the height vector
class(height)
## [1] "numeric"
# Show the length of the height vector
length(height)
## [1] 5
weight
containing these data, following the same steps as before, making sure to specify the missing value as NA
(no quotation marks).# Create a weight vector
weight <- c(113, 75, 56, 934, NA)
# Print the weight vector
weight
## [1] 113 75 56 934 NA
# Show the class of the weight vector
class(weight)
## [1] "numeric"
# Show the length of the weight vector
length(weight)
## [1] 5
table()
function, find out how many males and females are in your sex
object. You should find 4 males and 1 females!# Count types in sex
table(sex)
## sex
## f m
## 1 4
mean()
function, calculate the mean age
. It should be 38.2!# Compute mean of age
mean(age)
## [1] 38.2
sex
. What happens? Why?# Compute mean of sex
mean(sex)
## [1] NA
weight
. You should get an NA
value. Why?# Compute mean of sex
mean(weight)
## [1] NA
mean()
function (using ?mean
) to look for an argument that will help you with your problem.# Inspect help for mean
?mean
NA
values. It should be 84.5!# Compute mean weight, ignoring NAs
mean(weight, na.rm = TRUE)
## [1] 294
In this section, you will read in a subset of the well known diamonds data set and prepare it for data analysis.
diamonds.csv
dataset using the ""
(quotation marks) auto-complete trick. Place the cursor between two quotation marks, hit ⇥ (tab-key), and browse through the folders. Save the file path, for now, in an object called diamonds_path
.# place cursor in-between "" and hit tab
diamonds_path <- ""
# place cursor in-between and hit tab
diamonds_path <- "1_Data/diamonds.csv"
diamonds_path
inside the read_csv()
function to read in the diamonds.csv
dataset. Store it as a new object called diamonds
.# read diamonds data
diamonds <- read_csv(file = XX)
# read diamonds data
diamonds <- read_csv(file = diamonds_path)
diamonds
data and inspect the column names in the header line. Something’s wrong!# print diamonds
diamonds
## # A tibble: 99 x 7
## `0.8` `Very Good` H VVS1 `62.9` `58` `4468`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0.74 Ideal H IF 60.9 57 3760
## 2 2.03 Premium I SI1 61.4 58 15683
## 3 0.41 Ideal G VVS1 62.1 55 1151
## 4 1.54 Premium G VS1 61.1 56 14438
## 5 0.3 Ideal E VS2 61.8 55 795
## 6 0.3 Ideal H VVS2 61.5 56 605
## 7 1.2 Ideal D SI1 61.8 58 7508
## 8 0.58 Ideal E VS2 62.3 54 1809
## 9 0.31 Ideal H VS2 62.6 57 489
## 10 1.24 Very Good F VS1 59 60 9885
## # … with 89 more rows
col_names
-argument. Assign to col_names
a character vector containing the correct column names: carat
, cut
, color
, clarity
, depth
, table
, price
.# read diamonds data with specified col_names
diamonds <- read_csv(file = "XX",
col_names = c('name_1','name_2','...')) # Vector of column names
# read diamonds data with specified col_names
diamonds <- read_csv(file = diamonds_path,
col_names = c("carat", "cut", "color", "clarity", "depth", "table", "price"))
# print diamonds
diamonds
## # A tibble: 100 x 7
## carat cut color clarity depth table price
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0.8 Very Good H VVS1 62.9 58 4468
## 2 0.74 Ideal H IF 60.9 57 3760
## 3 2.03 Premium I SI1 61.4 58 15683
## 4 0.41 Ideal G VVS1 62.1 55 1151
## 5 1.54 Premium G VS1 61.1 56 14438
## 6 0.3 Ideal E VS2 61.8 55 795
## 7 0.3 Ideal H VVS2 61.5 56 605
## 8 1.2 Ideal D SI1 61.8 58 7508
## 9 0.58 Ideal E VS2 62.3 54 1809
## 10 0.31 Ideal H VS2 62.6 57 489
## # … with 90 more rows
Now pay attention to the classes of the individuals columns (variables). Have all classes been identified correctly? What about the carat
column? It should be numeric
, right?
Let’s see what went wrong. Select and print the carat
variable to identify the one entry that caused the variable to become a character
vector (Hint: look for a comma between entry 10 and 20).
# print the carat column
diamonds$carat
## [1] "0.8" "0.74" "2.03" "0.41" "1.54" "0.3" "0.3" "1.2" "0.58" "0.31"
## [11] "1.24" "0.91" "1.28" "0.31" "1.02" "1" "0,37" "0.55" "0.54" "0.34"
## [21] "0.91" "0.9" "0.5" "0.31" "1.66" "0.47" "0.3" "0.7" "1.72" "0.41"
## [31] "1.06" "0.32" "0.4" "0.71" "0.3" "1.31" "1.08" "0.45" "0.3" "0.62"
## [41] "1.01" "2" "0.38" "2.03" "1" "0.38" "0.41" "0.49" "0.71" "1.51"
## [51] "1.02" "1.3" "0.32" "1.52" "0.59" "1.31" "1.05" "1.08" "0.43" "1.08"
## [61] "0.3" "0.4" "0.52" "0.41" "1" "0.33" "0.75" "0.26" "0.34" "1.49"
## [71] "0.3" "0.4" "0.71" "0.92" "0.7" "0.55" "1.47" "0.42" "0.58" "0.44"
## [81] "0.31" "0.3" "0.55" "0.41" "0.31" "0.33" "0.32" "2.67" "0.88" "0.57"
## [91] "0.36" "0.53" "0.79" "0.9" "0.31" "1.03" "0.39" "0.51" "0.34" "0.25"
carat
by replacing XX
with the index of the incorrect value (i.e., the correct number between 10 and 20) and YY
with the correct entry with a period (.
) instead of a comma (,
) in the code below.# Change the value at position XX to YY
diamonds$carat[XX] <- YY
# Change the value at position XX to YY
diamonds$carat[17] <- 0.37
carat
is still character
. We can fix it with the type_convert()
function. Apply the type_convert()
function to the diamonds
data to have R fix all the data types. Make sure to assign the result back to diamonds
so that you change the object!# re-infer data types
diamonds <- type_convert(diamonds)
diamonds
object and look at the column types. Has the type of carat
changed to double
?# print diamonds data set
diamonds
## # A tibble: 100 x 7
## carat cut color clarity depth table price
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0.8 Very Good H VVS1 62.9 58 4468
## 2 0.74 Ideal H IF 60.9 57 3760
## 3 2.03 Premium I SI1 61.4 58 15683
## 4 0.41 Ideal G VVS1 62.1 55 1151
## 5 1.54 Premium G VS1 61.1 56 14438
## 6 0.3 Ideal E VS2 61.8 55 795
## 7 0.3 Ideal H VVS2 61.5 56 605
## 8 1.2 Ideal D SI1 61.8 58 7508
## 9 0.580 Ideal E VS2 62.3 54 1809
## 10 0.31 Ideal H VS2 62.6 57 489
## # … with 90 more rows
.csv
file using the name diamonds_clean.csv
. Don’t forget to include both the file name and the folder (separated by /
) in the character string specifying the path
argument.# write clean diamonds data to disc
write_csv(x = XX, path = "XX")
# write clean diamonds data to disc
write_csv(x = diamonds, "1_Data/diamonds_clean.csv")
diamonds_clean.csv
back into R as a new object called diamonds_clean
. Then, print the object and verify that this time the types have been correctly identified from the start.# read clean diamonds data from disc
diamonds_clean <- read_csv(file = "1_Data/diamonds_clean.csv")
carat
or price
(use mean()
)? What cut
and clarity
levels exist and how often do they occur (use table()
on both variables)? You can learn more about the variable values from the help file ?diamonds
.# simple stats of diamonds
mean(diamonds$carat)
## [1] 0.742
mean(diamonds$price)
## [1] 3543
table(diamonds$cut)
##
## Fair Good Ideal Premium Very Good
## 2 5 46 31 16
table(diamonds$clarity)
##
## I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
## 1 4 17 19 15 24 10 10
$
TRUE
and FALSE
(and NA
s). Create a new logical vector called expensive
indicating which diamonds are more expensive than $10,000. To do this, select the price
variable from the data frame using $
use the >
(greater than) operator á la vector > value
.# Create a logical vector expensive indicating
# which dimaonds cost more than 10,000
ZZ <- diamonds$XX > YY
# Create a logical vector expensive indicating
# which dimaonds cost more than 10,000
expensive <- diamonds$price > 10000
expensive
vector to the console. Do you see only TRUE and FALSE values? If so, do the first few values match those in the price
variable?# print expensive
expensive
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [45] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE
expensive
vector to the diamonds data frame using data_frame$variable_name <- variable
. See below?# add vector to data frame
XX$YY <- ZZ
# add vector to data frame
diamonds$expensive <- expensive
table()
function, create a table showing how many of the diamonds are expensive how many are not. Select the variable directly from the data frame using $
.# count expensive diamonds
table(diamonds$expensive)
##
## FALSE TRUE
## 92 8
mean()
function, determine the percentage of the diamonds that are expensive, i.e., mean(expensive)
. Should this have worked?# percentage of expensive diamonds
mean(diamonds$expensive)
## [1] 0.08
carat
(i.e., more than .2 gram)?# percentage of diamonds heavier than 1 carat
mean(diamonds$carat > 1)
## [1] 0.26
read_excel()
, read in the titanic.xls
dataset as a new object called titanic
(Make sure you have already loaded the readxl
package at the beginning of your script).titanic <- read_excel(path = "XX")
# Read titanic data
titanic <- read_excel("1_Data/titanic.xls")
titanic
and evaluate its dimensions using dim()
.# print and show dimenisons
titanic
## # A tibble: 1,309 x 14
## pclass survived name sex age sibsp parch ticket fare cabin
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 1 Alle… fema… 29 0 0 24160 211. B5
## 2 1 1 Alli… male 0.917 1 2 113781 152. C22 …
## 3 1 0 Alli… fema… 2 1 2 113781 152. C22 …
## 4 1 0 Alli… male 30 1 2 113781 152. C22 …
## 5 1 0 Alli… fema… 25 1 2 113781 152. C22 …
## 6 1 1 Ande… male 48 0 0 19952 26.6 E12
## 7 1 1 Andr… fema… 63 1 0 13502 78.0 D7
## 8 1 0 Andr… male 39 0 0 112050 0 A36
## 9 1 1 Appl… fema… 53 2 0 11769 51.5 C101
## 10 1 0 Arta… male 71 0 0 PC 17… 49.5 <NA>
## # … with 1,299 more rows, and 4 more variables: embarked <chr>,
## # boat <chr>, body <dbl>, home.dest <chr>
dim(titanic)
## [1] 1309 14
table()
, how many people survived (variable survived
) in each cabin class (variable pclass
)?# determine survival rate by cabin class
table(titanic$XX,
titanic$XX)
# determine survival rate by cabin class
table(titanic$pclass, titanic$survived)
##
## 0 1
## 1 123 200
## 2 158 119
## 3 528 181
write_csv()
, write the titanic
dataframe as a new comma separated text file called titanic.csv
in your 1_Data
folder. Now you have the data saved as a text file any software can use!# write data to .csv
write_csv(x = titanic,
path = "1_Data/titanic.csv")
read_spss()
read in the sleep data set sleep.sav
of staff at he University of Melbourne as a new object called sleep
. (Make sure that you have first loaded the haven
package).XX <- read_spss(file = "XX")
# Read sleep data
sleep <- read_spss(file = "1_Data/sleep.sav")
sleep
object and evaluate its dimensions using dim()
.# print and show dimensions
sleep
## # A tibble: 271 x 55
## id sex age marital edlevel weight height healthrate fitrate
## <dbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl> <dbl> <dbl+lbl> <dbl+l>
## 1 83 0 42 2 2 52 162 10 7
## 2 294 0 54 2 5 65 174 8 7
## 3 425 1 NA 2 2 89 170 6 5
## 4 64 0 41 2 5 66 178 9 7
## 5 536 0 39 2 5 62 160 9 5
## 6 57 0 66 2 4 62 165 8 8
## 7 251 0 36 1 3 62 165 9 7
## 8 255 0 35 2 5 75 174 6 6
## 9 265 1 NA 2 5 90 180 6 6
## 10 290 1 41 2 5 75 187 9 9
## # … with 261 more rows, and 46 more variables: weightrate <dbl+lbl>,
## # smoke <dbl+lbl>, smokenum <dbl>, alchohol <dbl>, caffeine <dbl>,
## # hourwnit <dbl>, hourwend <dbl>, hourneed <dbl>, trubslep <dbl+lbl>,
## # trubstay <dbl+lbl>, wakenite <dbl+lbl>, niteshft <dbl+lbl>,
## # liteslp <dbl+lbl>, refreshd <dbl+lbl>, satsleep <dbl+lbl>,
## # qualslp <dbl+lbl>, stressmo <dbl+lbl>, medhelp <dbl+lbl>,
## # problem <dbl+lbl>, impact1 <dbl+lbl>, impact2 <dbl+lbl>,
## # impact3 <dbl+lbl>, impact4 <dbl+lbl>, impact5 <dbl+lbl>,
## # impact6 <dbl+lbl>, impact7 <dbl+lbl>, stopb <dbl+lbl>,
## # restlss <dbl+lbl>, drvsleep <dbl+lbl>, drvresul <dbl+lbl>, ess <dbl>,
## # anxiety <dbl>, depress <dbl>, fatigue <dbl>, lethargy <dbl>,
## # tired <dbl>, sleepy <dbl>, energy <dbl>, stayslprec <dbl+lbl>,
## # getsleprec <dbl+lbl>, qualsleeprec <dbl+lbl>, totsas <dbl>,
## # cigsgp3 <dbl+lbl>, agegp3 <dbl+lbl>, probsleeprec <dbl+lbl>,
## # drvslprec <dbl+lbl>
dim(sleep)
## [1] 271 55
alcohol
). To do this, use the mean()
function, while taking care of missing values using the na.rm
argument.# compute mean number of drinks
mean(x = sleep, na.rm = TRUE)
## [1] NA
write_csv()
function, write the sleep
data to a new file called sleep.csv
in your 1_Data
folder. Now you have the sleep
data stored as a text file any software can use!# write data to .csv
write_csv(x = sleep,
path = "1_Data/sleep.csv")
read_sas()
, read in airbnb_zuerich.sas7bdat
containing AirBnB listings in Zürich, Switzerland and call the object airbnb_zuerich
.# read sas data
XX <- read_sas(data_file = "XX")
# read airbnb_zuerich.sas7bdat
airbnb_zuerich <- read_sas(data_file = "1_Data/airbnb_zuerich.sas7bdat")
airbnb_zuerich
and then evaluate its dimensions using dim()
.# print and show dimenisons
airbnb_zuerich
## # A tibble: 2,392 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1.37e7 1363 5.63e7 Entire h… "" Zuri… "" Kreis 12
## 2 8.00e6 1363 1.65e7 Entire h… "" Zuri… "" Kreis 7
## 3 1.52e7 1363 5.03e7 Entire h… "" Zuri… "" Kreis 4
## 4 7.56e6 1363 4.92e6 Entire h… "" Zuri… "" Kreis 1
## 5 1.86e7 1363 2.04e7 Entire h… "" Zuri… "" Kreis 12
## 6 6.44e6 1363 1.24e7 Entire h… "" Zuri… "" Kreis 2
## 7 1.88e6 1363 1.60e6 Entire h… "" Zuri… "" Kreis 7
## 8 3.63e6 1363 1.83e7 Entire h… "" Zuri… "" Kreis 8
## 9 1.44e7 1363 5.19e7 Entire h… "" Zuri… "" Kreis 2
## 10 1.28e7 1363 2.64e5 Entire h… "" Zuri… "" Kreis 2
## # … with 2,382 more rows, and 12 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <chr>, price <dbl>, minstay <chr>, name <chr>,
## # last_modified <dttm>, latitude <dbl>, longitude <dbl>, location <chr>
dim(airbnb_zuerich)
## [1] 2392 20
room_type
in Zürich? (Hint: use table()
)# table room type
table(airbnb_zuerich$room_type)
##
## Entire home/apt Private room Shared room
## 1386 975 31
write_csv()
write your airbnb_zuerich
data frame to as new comma-separated text file called airbnb_zuerich.csv
in your 1_Data
folder.# write data to .csv
write_csv(x = airbnb_zuerich,
path = "1_Data/airbnb_zuerich.csv")
data.frame()
function, create a data frame called ten_df
that contains each of the vectors you just created: id
, age
, sex
, height
, weight
.# Create data frame ten_df containing vectors id, age, sex, height, and weight.
XX <- data.frame(XX, XX, XX, XX, XX, XX)
# Create ten_df data frame from vectors
ten_df <- data.frame(id, age, sex, height, weight)
ten_df
object to see how it looks! Does it contain all of the vectors?# Print ten_df
ten_df
## id age sex height weight
## 1 1 44 m 174 113
## 2 2 65 m 180 75
## 3 3 31 f 168 56
## 4 4 27 m 209 934
## 5 5 24 m 177 NA
dim()
function, print the number of rows and columns in your data frame. Do you get 10 rows and 5 columns?# Inspect dimensions
dim(ten_df)
## [1] 5 5
ten_df
object? Use the class()
function to find out!# Inspect class
class(ten_df)
## [1] "data.frame"
summary()
function to print descriptive statistics from each column of ten_df
# Inspect class
summary(ten_df)
## id age sex height weight
## Min. :1 Min. :24.0 f:1 Min. :168 Min. : 56
## 1st Qu.:2 1st Qu.:27.0 m:4 1st Qu.:174 1st Qu.: 70
## Median :3 Median :31.0 Median :177 Median : 94
## Mean :3 Mean :38.2 Mean :182 Mean :294
## 3rd Qu.:4 3rd Qu.:44.0 3rd Qu.:180 3rd Qu.:318
## Max. :5 Max. :65.0 Max. :209 Max. :934
## NA's :1
$
operator, print the age
column from the ten_df
data frame.# Inspect age
ten_df$age
## [1] 44 65 31 27 24
age
value from the ten_df
data frame using max()
. Do you get the same result from when you calculated it from the original vector age
?# Get max
max(ten_df$age)
## [1] 65
data.frame()
function, try creating a tibble called ten_tibble
using the tibble()
function. tibble
s are a more modern, leaner variant of data frame that we prefer over classic data.frame
. You can use the exact same arguments you used before.# create tibble
ten_tibble = tibble(id, age, sex, height, weight)
ten_tibble
object, how does it look different from ten_df
? Try calculating the maximum age
from this object. Is it different from what you got before?# print tibble
ten_tibble
## # A tibble: 5 x 5
## id age sex height weight
## <int> <dbl> <chr> <dbl> <dbl>
## 1 1 44 m 174 113
## 2 2 65 m 180 75
## 3 3 31 f 168 56
## 4 4 27 m 209 934
## 5 5 24 m 177 NA
max(ten_tibble$age) == max(ten_df$age)
## [1] TRUE
sum()
of a logical vector, R will return the number of cases that are TRUE
. Using this, find out how many of the ten Baselers are male while using the is-equal-to operator ==
.# Determine the frequency of a case in a vector
sum(XX == XX)
# Determine the frequency of a case in a vector
sum(ten_tibble$sex == "male")
## [1] 0
# Create a logical vector indicating which baselers are younger than 30
young_30 <- XX$XX < 30
# Print the ids of baselers younger than 30
XX$XX[young_30]
# Create a logical vector indicating which baselers are younger than 30
young_30 <- ten_tibble$age < 30
# Print the ids of baselers younger than 30
ten_tibble$id[young_30]
## [1] 4 5
mean()
function to answer the question: “What is the mean age of Baselers who are heavier than 80kg?”# Mean age of baselers heavier than 80kg
mean(ten_tibble$age[ten_tibble$weight > 80])
## [1] NA
&
to combine multiple logical vectors)# Mean age of baselers heavier than 80kg
ten_tibble$id[ten_tibble$sex == "male" & ten_tibble$height < 165]
## integer(0)
library(tidyverse)
library(readxl)
library(haven)
# Create vectors of (fake) stock data
name <- c("apple", "microsoft", "dell", "google", "twitter")
yesterday <- c(100, 89, 65, 54, 89)
today <- c(102, 85, 72, 60, 95)
# Summary statistics
mean(today)
mean(yesterday)
# Show classes
class(name)
class(yesterday)
# Operations of vectors
change <- today - yesterday
change # Print result
# Create a logical vector from two numerics
increase <- today > yesterday
increase # Print result
# Create a tibble combining multiple vectors
stocks <- tibble(name, yesterday, today, change, increase)
# Get column names
names(stocks)
# Access columns by name
stocks$name
stocks$today
# Calculate descriptives on columns
mean(stocks$yesterday)
median(stocks$today)
table(stocks$increase)
max(stocks$increase)
# read/write delim-separated -------------------
# read chickens data
chickens <- read_csv(file = "1_Data/chickens.csv")
# fix header of chickens_nohead.csv with known column names
chickens <- read_csv(file = "1_Data/chickens_nohead.csv",
col_names = c("weight", "time", "chick", "diet"))
# fix NA values of chickens_na.csv
chickens <- read_csv(file = "1_Data/chickens_na.csv",
na = c('NA', 'NULL'))
# write clean data to disc
write_csv(x = chickens,
path = "1_Data/chickens_clean.csv")
# fix types -------------------
# Note: the survey data is fictional!
# remove character from rating
survey$rating[survey$rating == "2,1"] <- 2.1
# rerun type convert
survey <- type_convert(survey)
# other formats -------------------
# .xlsx (Excel)
chickens <- read_excel("1_Data/chickens.xlsx")
# .sav (SPSS)
chickens <- read_spss("1_Data/chickens.sav")
# .sad7bdata (SAS)
chickens <- read_sas("1_Data/chickens.sas7bdat")
library(tidyverse)
library(readxl)
library(haven)
File | Rows | Columns | Description |
---|---|---|---|
diamonds.csv | 100 | 7 | Subset of the well-known diamonds data set containing specifications and prices of a large number of recorded diamonds. |
titanic.xls | 1309 | 14 | Information on the survival of titanic passengers. |
sleep.sav | 271 | 55 | Survey on sleeping behavior completed by staff at the University of Melbourne. |
airbnb_zuerich.sas7bdat | 2392 | 20 | Data on AirBnB listings in Zürich, Switzerland |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
haven |
install.packages("haven") |
readxl |
install.packages("readxl") |
Creating vectors
Function | Description |
---|---|
c("a", "b", "c") |
Create a character vector |
c(1, 2, 3) |
Create a numeric vector |
c(TRUE, FALSE, TRUE) |
Create a logical vector |
Vector functions
Function | Description |
---|---|
mean(x), median(x), sd(x), sum(x) |
Mean, median standard deviation, sum |
max(x), min(x) |
Maximum, minimum |
table(x) |
Table of frequency counts |
round(x, digits) |
Round a numeric vector x to digits |
Accessing vectors from data frames
Function | Description |
---|---|
df$name |
Access vector name from a data frame df |
Reading/writing text data
Extension | File Type | Read | Write |
---|---|---|---|
.csv |
Comma-separated text | read_csv(file) |
write_csv(x, file) |
.csv |
Semi-colon separated text | read_csv2(file) |
not available |
.txt |
Other text | read_delim(file) |
write_delim(x, file) |
Reading/writing other data formats
Extension | File Type | Read | Write |
---|---|---|---|
.xls , .xlsx |
Excel | read_excel(file) |
xlsx::write.xlsx() |
.sav |
SPSS | read_spss(file) |
write_spss(x, file) |
.sas7bdat |
SAS | read_sas(file) |
write_sas(x, file) |
Creating data frames from vectors
Function | Description |
---|---|
data.frame(a, b, c) |
Create a data frame from vectors a, b, c |
tibble(a, b, c) |
Create a tibble from vectors a, b, c |