Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to write efficient code. By the end of this practical you will know how to:
Functions to profile your code are:
Function | Package | Description |
---|---|---|
proc.time() |
base |
Returns the time. |
system.time() |
base |
Runs one expression once and returns elapsed CPU time |
microbenchmark() |
microbenchmark |
Runs one or many expressions multiple times and returns statistics on elapsed time. |
lineprof(), shine() |
lineprof |
Evaluates entire scripts. (From Hadley’s Github) |
Small (minimal) chunks of code can conveniently be tested using microbenchmark()
.
# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(1000,1),
'var_2' = rnorm(1000,1))
tbl <- as.tibble(df)
# microbenchmark pt. 1
microbenchmark(df[['var_1']], df$var_1, tbl$var_1)
Larger code chunks or even scripts can conveniently be tested using system.time()
and lineprof()
from the lineprof
package.
# ---- install and load package
install.packages('devtools')
devtools::install_github("hadley/lineprof")
library(lineprof)
library(readr)
library(dplyr)
# ---- define code chunk as function
my_chunkfun <- function(){
# load data
data <- read_csv('http://tinyurl.com/titanic-dataset-web')
# remove first column
data <- data[,-1]
# mutate
data <- data %>%
mutate(months = Age * 12)
# select
test_data <- data %>%
select(Sex, Age, Survived)
# multiple regression
# Survival predicted by Sex, Age, and their interaction
model <- glm(Survived ~ Sex * Age,
data = test_data,
family = 'binomial')
# evaluate model
summary(model)
}
# ---- profiling
# profile using system.time
system.time(my_chunkfun())
# profile using lineprof
#profile <- lineprof(my_chunkfun())
#shine(profile)
tibble
’s fast or slow?# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(1000, 1),
'var_2' = rnorm(1000, 1))
tbl <- as.tibble(df)
# microbenchmark
microbenchmark(df[['var_1']], df$var_1, tbl$var_1)
Unit: microseconds
expr min lq mean median uq max neval cld
df[["var_1"]] 3.134 4.228 5.74518 4.9390 5.9190 41.958 100 a
df$var_1 4.072 5.404 6.60436 6.3495 7.1760 14.947 100 a
tbl$var_1 34.547 38.098 42.09851 39.3750 40.9395 176.698 100 b
tibble
s and basic data.frame
s of the first exercise and include now for both data frame types also the .subset2()
function (don’t forget the dot). The function takes two arguments: The first argument is the data frame, the second argument is the column identifyier (index or name). What do you find?# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(1000,1),
'var_2' = rnorm(1000,1))
tbl <- as.tibble(df)
# microbenchmark
microbenchmark(df[['var_1']], df$var_1, tbl$var_1,
.subset(df,'var_1'), .subset(tbl,'var_1'))
Unit: nanoseconds
expr min lq mean median uq max neval
df[["var_1"]] 3361 4194.0 5043.88 4979.5 5940.5 7789 100
df$var_1 4193 5021.0 6354.11 5794.0 7090.5 37153 100
tbl$var_1 34665 38137.0 40737.62 39271.0 41687.0 119972 100
.subset(df, "var_1") 364 503.0 595.56 586.0 674.5 1162 100
.subset(tbl, "var_1") 348 484.5 713.75 589.0 698.5 12412 100
cld
b
b
c
a
a
mean()
to the operation composed of its basic ingredients sum()
and length()
, i.e., sum(my_vec) / length(my_vec)
. To do this first create a vector consisting of random numbers using runif()
(see ?runif
). Then test both ways with microbenchmark()
What do you find?# define vector
my_vec <- runif(10000)
# microbenchmark
microbenchmark(mean(my_vec), sum(my_vec)/length(my_vec))
Unit: microseconds
expr min lq mean median uq max
mean(my_vec) 17.526 24.3470 23.83141 24.418 24.5330 37.777
sum(my_vec)/length(my_vec) 8.034 11.2105 11.16223 11.288 11.3335 18.872
neval cld
100 b
100 a
mean()
, sum()
, length()
, and .subset2()
using typeof()
. What’s the fast type?# test type
typeof(mean); typeof(sum); typeof(length); typeof(.subset2)
[1] "closure"
[1] "builtin"
[1] "builtin"
[1] "builtin"
devtools
and lineprof
packages, run the code under ‘define code chunk as function’ and then test the function by runnning it, i.e., execute my_chunkfun()
. You just defined and executed your first self-created function. Continue by profiling the function using system.time()
and lineprof()
. What do you find? What parts of the code are most computationally expensive? Repeat the analysis. Remember R compiles functions after first use.# ---- install and load package
install.packages('devtools', repos = "https://stat.ethz.ch/CRAN/")
The downloaded binary packages are in
/var/folders/4j/gkx0z2kn1b5djq50kwgl2wdc0000gp/T//RtmpLf9kur/downloaded_packages
devtools::install_github("hadley/lineprof")
library(lineprof)
library(readr)
library(dplyr)
# ---- define code chunk as function
my_chunkfun <- function(){
# load data
data <- read_csv('http://tinyurl.com/titanic-dataset-web')
# remove first column
data <- data[,-1]
# mutate
data <- data %>%
mutate(months = Age * 12)
# select
test_data <- data %>%
select(Sex, months, Survived)
# multiple regression
# Survival predicted by Sex, months, and their interaction
model <- glm(Survived ~ Sex * months,
data = test_data,
family = 'binomial')
# evaluate model
summary(model)
}
# ---- profiling
# profile using system.time
system.time(my_chunkfun())
user system elapsed
0.119 0.008 1.212
# profile using lineprof
#profile <- lineprof(my_chunkfun())
#shine(profile)
data.table
package (means: install and load) and use the fread()
function. Try defining a new function using this function rather than read_csv()
. Then compare the performance of the two. How much faster is the new relative to the old function (use system.time()
)?# ---- install and load package
install.packages('data.table', repos = "https://stat.ethz.ch/CRAN/")
The downloaded binary packages are in
/var/folders/4j/gkx0z2kn1b5djq50kwgl2wdc0000gp/T//RtmpLf9kur/downloaded_packages
library(data.table)
# ---- define code chunk as function
my_chunkfun_fast <- function(){
# load data
data <- fread('http://tinyurl.com/titanic-dataset-web')
# remove first column
data <- data[,-1]
# mutate
data <- data %>%
mutate(months = Age * 12)
# select
test_data <- data %>%
select(Sex, months, Survived)
# multiple regression
# Survival predicted by Sex, months, and their interaction
model <- glm(Survived ~ Sex * months,
data = test_data,
family = 'binomial')
# evaluate model
summary(model)
}
# ---- profiling
# profile using system.time
system.time(my_chunkfun())
user system elapsed
0.148 0.006 1.204
system.time(my_chunkfun_fast())
user system elapsed
0.027 0.004 1.560
# profile using lineprof
#profile <- lineprof(my_chunkfun_fast())
#shine(profile)
# ---- define code chunnk as function
my_chunkfun_fast2 <- function(){
# load data
data <- fread('http://tinyurl.com/titanic-dataset-web')
# remove first column
data <- data[,-1]
# mutate
data <- data %>%
mutate(months = Age * 12)
# multiple regression
# Survival predicted by Sex, months, and their interaction
model <- glm(Survived ~ Sex * months,
data = data,
family = 'binomial')
# evaluate model
summary(model)
}
# ---- profiling
# profile using lineprof
#profile <- lineprof(my_chunkfun_fast2())
#shine(profile)
# profile using system.time
system.time(my_chunkfun())
user system elapsed
0.025 0.002 1.153
system.time(my_chunkfun_fast())
user system elapsed
0.028 0.005 1.364
system.time(my_chunkfun_fast2())
user system elapsed
0.014 0.004 1.296
# ---- define code chunk as function
my_chunkfun_fast3 <- function(){
# load data
data <- fread('http://tinyurl.com/titanic-dataset-web')
# remove first column
data <- data[,-1]
# mutate
data[['months']] <- data$Age * 12
# multiple regression
# Survival predicted by Sex, months, and their interaction
model <- glm(Survived ~ Sex * months,
data = data,
family = 'binomial')
# evaluate model
summary(model)
}
# ---- profiling
# profile using lineprof
#profile <- lineprof(my_chunkfun_fast3())
#shine(profile)
# profile using system.time
system.time(my_chunkfun())
user system elapsed
0.026 0.002 1.135
system.time(my_chunkfun_fast())
user system elapsed
0.021 0.004 1.597
system.time(my_chunkfun_fast2())
user system elapsed
0.018 0.004 1.535
system.time(my_chunkfun_fast3())
user system elapsed
0.010 0.004 1.090
sort()
and selecting the first element of the sorted vector, i.e., sort(my_vector)[1]
. Then feed it random vectors (using runif()
) of length either 1e5+, 1e+6, 1e+7, 1e+8, and 1e+9 and evalute the computation time (using system.time()
). Does it increase by more or less than 10 times each step? You want to repeat this a couple of times.Excursion: How to program functions? Functions are always defined as this my_fun <- function(){}
. Within the parentheses you define the names of the arguments, e.g., function(variable_1, variable_2)
. Within the curly brackets you define the function’s expression, i.e., what it’s supposed to do. This could be for instance variable_1 + variable_2
, when the goal is to compute the element-wise sum of two vectors. By calling (executing) the function, the argument names inside the functions expression, i.e., variable_1
and variable_2
will then be replaced by the objects that were provided (passed on) as arguments. That is, if the function is provided with two vectors my_vec_1
and my_vec_2
, i.e., my_fun(my_vec_1, my_vec_2)
, then the function will compute the sum of these two vectors. This requires, of course, that the provided arguments fit to whatever is done with them inside the function. In this case, the objects are thus required to be numerical
and cannot be, e.g., of type character
. The full function definition in this case is my_fun <- function(variable_1, variable_2) {variable_1 + variable_2}
. After its been defined you would could call it using my_fun(object_1, object_2)
.
# define function
my_fun <- function(x) sort(x)[1]
# profile using system.time
system.time(my_fun(runif(1e+5)))
user system elapsed
0.008 0.000 0.008
system.time(my_fun(runif(1e+6)))
user system elapsed
0.101 0.012 0.114
system.time(my_fun(runif(1e+7)))
user system elapsed
1.628 0.150 1.781
system.time(my_fun(runif(1e+8)))
user system elapsed
13.990 1.668 15.738
paralell
package, which has recently been included in the standard R library. To use parallel execution four things need to be done. (1) The data need to be split into separate jobs. For instance, a vector may be split into a list containing 100 separate pieces. (2) A function needs to be defined that performs the desired operations on a single job (one piece of the vector). (3) A cluster of workers needs to be created using, e.g., makeCluster
. (3) The jobs and the function need to be combined in one of parallel
‘s functions. Those functions manage the passing on of jobs to individual workers in the cluster and the retrieval of the results of their computations. This particular style of programming is also known as functional programming. Now try to run the code below, which implements the function from above in this ’divide and conquer’-manner. If it runs and you understood what it does, compare its execution time to its non-parallel twin above.library(parallel)
# define data
my_vec <- runif(1e+8)
# create jobs
# matrix splits data in 100 columns
# as.data.frame and as.list transform it to a list with 100 vectors
jobs <- as.list(as.data.frame(matrix(my_vec, ncol = 100)))
# define function
my_fun <- function(x) sort(x)[1]
# create a cluster with as many workers as cores
cl <- makeCluster(detectCores())
# apply function to jobs using a load balanced (LB) handler
result <- clusterApplyLB(cl, jobs, my_fun)
# flatten result and apply my_fun one last time
my_fun(unlist(result))
# load package
library(parallel)
# define parallel fun
my_parallel <- expression({
# create jobs
# matrix splits data in 100 columns
# as.data.frame and as.list transform it to a list with 100 vectors
jobs <- as.list(as.data.frame(matrix(my_vec, ncol = 100)))
# define function
my_fun <- function(x) sort(x)[1]
# create a cluster with as many workers as cores
cl <- makeCluster(detectCores())
# apply function to jobs using a load balanced (LB) handler
result <- clusterApplyLB(cl, jobs, my_fun)
# stop cluster
stopCluster(cl)
# flatten result and apply my_fun one last time
my_fun(unlist(result))
})
# define data
my_vec <- runif(1e+8)
# test timing
system.time(sort(my_vec)[1])
user system elapsed
10.111 1.284 11.416
system.time(eval(my_parallel))
user system elapsed
2.273 0.830 5.864
For more details check out check out Hadley Wickham’s Advanced R.
For more on parallel computing see Parallel R by McCallum and Weston.