Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to use R’s basic data objects and how to use and write functions. By the end of this practical you will know how to:
Here are the main functions for object creation.
Function | Description |
---|---|
c(), rep(), seq(), numeric(), etc. |
Create a vector |
matrix(), cbind() |
Create a matrix |
data.frame() |
Create a data.frame |
list() |
Create a list |
Here are the main functions for object inspection.
Function | Description |
---|---|
head(), tail() |
Inspect first or last elements of object |
str() |
Inspect the structure of the data object |
View() |
To access an Excel like data interface |
Here are the main functions for object selection.
Function | Description |
---|---|
[ ] |
Single brackets: Select individual elements from vector |
[[ ]] |
Double brackets: Select element/variable in list or data_frame |
$ |
Dollar: Select named element/variable from list or data_frame (preferred) |
Here is the function to create a function.
Function | Description |
---|---|
function() |
Creates a function |
c()
and assign them as dbl_vec
, log_vec
, and chr_vec
(using the assignment operator, i.e., name <- c())
. For example to save the three numbers 1, 2, and 3 in a vector called three_numbers execute three_numbers <- c(1, 2, 3)
.dbl_vec <- c(1,2,3,4,5,6,7,8,9,10)
chr_vec <- c('a','b','c','d','e','f','g','h','i','j')
log_vec <- c(TRUE,TRUE,FALSE,TRUE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE)
typeof()
and length()
. To run the functions simply place the name of the individual vectors in the parenthesis and then send to the console to execute. Remember you can send any line in your editor to the console using cmd + enter
(MAC) or cntrl + enter
(Windows).typeof(dbl_vec) ; length(dbl_vec)
[1] "double"
[1] 10
typeof(chr_vec) ; length(chr_vec)
[1] "character"
[1] 10
typeof(log_vec) ; length(log_vec)
[1] "logical"
[1] 10
rep()
to two times their length. rep()
is a general purpose repeat function for vectors. It takes (at least) two arguments: x
the vector and times
, which can be a single integer, e.g., 2
. Take a look at the functions help file using ?rep
. To expand the vectors, you need to apply rep()
and assign the result back to your vector object, i.e., object <- rep(object, argument)
.dbl_vec <- rep(dbl_vec, 2)
chr_vec <- rep(chr_vec, 2)
log_vec <- rep(log_vec, 2)
data_frame
, where N is the length of each of the vectors. To do this, first load the tibble
package, which is part of the tidyverse
. To create a data frame use the data_frame()
function. Look at the help file (?data_frame
& ?data_frame
). The first argument to data_frame()
is ...
which is a placeholder for as many objects as you want to include in the data frame. To create a data_._frame
from your three vectors pass on the three vectors separated by comma, e.g., data_frame(vector_1, vector_2, vector_3)
, and assign the result to, e.g., my_df
. Now that you created your data.frame, look at the data frame using head()
. Observe that the variables are included in the order that you provided them in. Next, test its type (typeof
) and dimension (dim()
). Do they make sense?library(tibble)
my_df <- data_frame(dbl_vec,chr_vec,log_vec)
head(my_df)
# A tibble: 6 x 3
dbl_vec chr_vec log_vec
<dbl> <chr> <lgl>
1 1. a TRUE
2 2. b TRUE
3 3. c FALSE
4 4. d TRUE
5 5. e FALSE
6 6. f FALSE
typeof(my_df)
[1] "list"
dim(my_df)
[1] 20 3
[[1]]
, (2) using double brackets and name [['dbl_vec']]
, and our preferred way using the dollar operator and name, e.g., $dbl_vec
. To inspect the names of the elements use names(my_df)
or str(my_df)
, which provides a richer overview over your data object.names(my_df)
[1] "dbl_vec" "chr_vec" "log_vec"
str(my_df)
Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 3 variables:
$ dbl_vec: num 1 2 3 4 5 6 7 8 9 10 ...
$ chr_vec: chr "a" "b" "c" "d" ...
$ log_vec: logi TRUE TRUE FALSE TRUE FALSE FALSE ...
my_df[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
my_df[['chr_vec']]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "a" "b" "c" "d" "e" "f" "g"
[18] "h" "i" "j"
my_df$dbl_vec
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
my_df["dbl_vec"]
, and name, e.g., my_df[["dbl_vec"]]
and test their types using typeof()
. What kind of object do you get in the first and second case? Understand why we prefer the $
-operator?typeof(my_df[["dbl_vec"]])
[1] "double"
typeof(my_df["dbl_vec"])
[1] "list"
as.numeric()
, as.character()
, as.logical()
. Can every type be coerced into every other? Which coercion does always work?as.numeric(chr_vec)
Warning: NAs introduced by coercion
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
as.numeric(log_vec)
[1] 1 1 0 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1
as.logical(dbl_vec)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE
as.logical(chr_vec)
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
as.character(dbl_vec)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "1" "2" "3" "4"
[15] "5" "6" "7" "8" "9" "10"
as.character(log_vec)
[1] "TRUE" "TRUE" "FALSE" "TRUE" "FALSE" "FALSE" "TRUE" "TRUE"
[9] "TRUE" "TRUE" "TRUE" "TRUE" "FALSE" "TRUE" "FALSE" "FALSE"
[17] "TRUE" "TRUE" "TRUE" "TRUE"
my_df
are equal c("dbl_vec", "chr_vec", "log_vec")
. This can be done using the is-equal-to operator ==
, e.g., vec_1 == vec_2
. In the case of vectors, logical comparisons iterate through the vectors one by one and evaluate whether the elements at the present location are equal. The result is a logical vector of the same length as the two objects involved in the logical comparison. To evaluate whether all elements are equal one can conveniently use the all()
function. Try compare the names of my_df
to c(dbl_vec", "chr_vec", "log_vec")
and test whether all comparisons resulted in TRUE
.all(names(my_df) == c("dbl_vec", "chr_vec", "log_vec"))
[1] TRUE
[]
or [[]]
) can take logical vectors as an argument (provided that the dimensions match to the to be accessed data object). E.g., c(1, 2, 3)[c(TRUE, FALSE, TRUE)]
returns the first and the third element of the vector, that is it returns those elements for which the logical vector is TRUE
. Try to use this now on your three vectors. Use log_vec
to subset the elements in dbl_vec
and chr_vec
.dbl_vec[log_vec]
[1] 1 2 4 7 8 9 10 1 2 4 7 8 9 10
chr_vec[log_vec]
[1] "a" "b" "d" "g" "h" "i" "j" "a" "b" "d" "g" "h" "i" "j"
chr_vec
for which dbl_vec
is larger than some value. This can be easily accomplished by coercing dbl_vec
to a logical vector using >
. Consider c(1, 4, 7, 2) > 3
. Try this for chr_vec
and dbl_vec
with an appropriate cut-off value.chr_vec[dbl_vec > 2]
[1] "c" "d" "e" "f" "g" "h" "i" "j" "c" "d" "e" "f" "g" "h" "i" "j"
&
and the logical OR
operator |
. Run c(1, 4, 7, 2) > 3 & c(1, 4, 7, 2) < 6
and c(1, 4, 7, 2) > 3 | c(1, 4, 7, 2) < 6
and observe the results. Try now combining the logical vector that results from coercing dbl_vec
to logical and log_vec
to subset chr_vec
.chr_vec[dbl_vec > 2 & log_vec]
[1] "d" "g" "h" "i" "j" "d" "g" "h" "i" "j"
list()
. list()
works exactly as c()
or data_frame()
. Call the object my_list
and inspect it using str()
. Compare the output of str()
to the output for str(my_df)
. What is different, and why?my_list <- list(dbl_vec,chr_vec,log_vec)
str(my_list)
List of 3
$ : num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
$ : chr [1:20] "a" "b" "c" "d" ...
$ : logi [1:20] TRUE TRUE FALSE TRUE FALSE FALSE ...
str(my_df)
Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 3 variables:
$ dbl_vec: num 1 2 3 4 5 6 7 8 9 10 ...
$ chr_vec: chr "a" "b" "c" "d" ...
$ log_vec: logi TRUE TRUE FALSE TRUE FALSE FALSE ...
my_list
into a data frame using as_data_frame()
and assign it to my_df_from_list
. Inspect again using str()
. Rename the columns to the original names of the vectors. To do this assign names(my_df)
to names(my_df_from_list)
. Confirm that it worked using either names()
or str()
.my_df_from_list <- as.data.frame(my_list)
str(my_df_from_list)
'data.frame': 20 obs. of 3 variables:
$ c.1..2..3..4..5..6..7..8..9..10..1..2..3..4..5..6..7..8..9..10 : num 1 2 3 4 5 6 7 8 9 10 ...
$ c..a....b....c....d....e....f....g....h....i....j....a....b... : Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ c.TRUE..TRUE..FALSE..TRUE..FALSE..FALSE..TRUE..TRUE..TRUE..TRUE..: logi TRUE TRUE FALSE TRUE FALSE FALSE ...
names(my_df_from_list) <- names(my_df)
names(my_df_from_list)
[1] "dbl_vec" "chr_vec" "log_vec"
str(my_df_from_list)
'data.frame': 20 obs. of 3 variables:
$ dbl_vec: num 1 2 3 4 5 6 7 8 9 10 ...
$ chr_vec: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ log_vec: logi TRUE TRUE FALSE TRUE FALSE FALSE ...
c()
, typeof()
, etc. Now, you are going to practice writing your own little functions. Let’s start with programming a function that calculates the arithmetic mean using the functions sum()
, which calculates the sum of all values in a vector and length()
, which determines the length of vectors (and other objects). The basic form of a function is: my_fun <- function(argument) {expression}
. The challenge in writing a function is to determine the argument(s), which are the the function’s input, and the expression, which defines the function’s job. In this case, both are relatively straightforward to define. The functions argument is any numeric
vector
and the functions expression should be the vectors sum divided by its length, i.e., sum(vector)/length(vector)
. Together: my_mean <- function(vector) {sum(vector) / length(vector)}
. You may ask now how does R know that vector
is supposed to be of type numeric
? It does not! R will attempt to execute the function with anything you through at it. Try it out! Define your my_mean
function and try it out for your double
and character
vectors from above.my_mean <- function(vector) {sum(vector) / length(vector)}
my_mean(dbl_vec)
[1] 5.5
#my_mean(chr_vec)
character
vector produced an error, right? The reason is that R does not interpret the names you use for the arguments. That is, R does not understand that an argument named vector is supposed to be a vector. For R these names are merely placeholders. In the background, R will use these names to pass on the object provided at the respective location. I.e., in the above case, R will take your vector, may it be dbl_vec
or chr_vec
, assign it to the name vector
, and pass this new object on to the expression. This means that you can choose anything as the argument names. The only thing you need to ensure is that you work with the names of the arguments in the expression, rather than the object’s real names. Try redefining the your my_mean
function using a different argument name, e.g., x
.my_mean <- function(x) {sum(x) / length(x)}
my_mean(dbl_vec)
[1] 5.5
#my_mean(chr_vec)
Note: R literally creates new objects by assigning the object you provide to a function to its argument names. This means that the objects used inside the function are duplicates of the objects that you provide and not the objects themselves. This is why functions never change an object unless you assign the function’s output back to the object or you explicitly tell R to change the object inside the function using the double arrow assignment operator <<-
.
In working with R, you often encounter objects that have a non-flat, nested structure. This means that there is more than one level of elements. Such objects are always list
s. List
s are the only object type that can store any other object, including themselves. That is, you can have a list of lists or a list of data.frames. Working with non-flat, nested objects is easier than it may seem. To access elements from such an object simply combine several dollar $
operators (or double bracket [[]]
) to descend into the objects structure. Let’s take a look.
my_df
and your three vectors (for a total of four elements). Try to verify (using the list structure) that each of the columns of my_df
is equal the respective vector. To access the columns in my_df
you have to combine selectors e.g., my_list$name$name
(the preferred way).my_list_2 <- list(my_df, dbl_vec, chr_vec, log_vec)
all(my_list_2[[1]]$dbl_vec == dbl_vec)
[1] TRUE
all(my_list_2[[1]]$chr_vec == chr_vec)
[1] TRUE
all(my_list_2[[1]]$log_vec == log_vec)
[1] TRUE
For more details on all steps of data analysis check out Hadley Wickham’s R for Data Science.
For more advanced content on objects check out Hadley Wickham’s Advanced R.
For more on pirates and data analysis check out the respective chapters in YaRrr! The Pirate’s Guide to R YaRrr! Chapter Link