In this practical you’ll learn how to create efficient programs in R. By the end of this practical you will know how to:
Functions to profile your code are:
Function | Package | Description |
---|---|---|
proc.time() |
base |
Returns the time. |
system.time() |
base |
Runs one expression once and returns elapsed CPU time |
microbenchmark() |
microbenchmark |
Runs one or many expressions multiple times and returns statistics on elapsed time. |
profvis |
Evaluates large expressions or entire scripts. From profvis package |
Minimal chunks of code can conveniently be tested using microbenchmark()
.
# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(10000,1),
'var_2' = rnorm(10000,1))
tbl <- as.tibble(df)
# microbenchmark pt. 1
microbenchmark(df[['var_1']], df$var_1, tbl$var_1)
Larger code chunks or even scripts can be conveniently tested using system.time()
and profvis()
from the profvis
package. The example below is based on the happiness
data set from the wooldridge
package.
# ---- install and load packages
install.packages('profvis')
library(profvis)
library(readr)
library(dplyr)
# ---- code to profile
# load data
data <- read_csv('https://tinyurl.com/happy-csv')
# remove rownames
data <- data[-1]
# mutate
data$attend_num = data$attend
data$attend_num[data$attend == 'never'] <- 0
data$attend_num[data$attend == 'lt once a year'] <- 1
data$attend_num[data$attend == 'once a year'] <- 2
data$attend_num[data$attend == 'sevrl times a yr'] <- 3
data$attend_num[data$attend == 'once a month'] <- 4
data$attend_num[data$attend == '2-3x a month'] <- 5
data$attend_num[data$attend == 'nrly every week'] <- 6
data$attend_num[data$attend == 'every week'] <- 7
data$attend_num[data$attend == 'more thn once wk'] <- 8
data$attend_num = as.numeric(data$attend_num)
# select
north <- data %>% filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
south <- data %>% filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
pacific <- data %>% filter(region == 'pacific')
# multiple regression
# happiness predicted by hours watching tv,
# attendance at religious services, ow
north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = north,
family = 'binomial')
south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = south,
family = 'binomial')
pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = pacific,
family = 'binomial')
# evaluate model
summary(north_atlantic_model)
summary(south_central_model)
summary(pacific_model)
# ---- profiling
# profile using profvis
profvis({
# load data
data <- read_csv('https://tinyurl.com/happy-csv')
# remove rownames
data <- data[-1]
# mutate
data$attend_num = data$attend
data$attend_num[data$attend == 'never'] <- 0
data$attend_num[data$attend == 'lt once a year'] <- 1
data$attend_num[data$attend == 'once a year'] <- 2
data$attend_num[data$attend == 'sevrl times a yr'] <- 3
data$attend_num[data$attend == 'once a month'] <- 4
data$attend_num[data$attend == '2-3x a month'] <- 5
data$attend_num[data$attend == 'nrly every week'] <- 6
data$attend_num[data$attend == 'every week'] <- 7
data$attend_num[data$attend == 'more thn once wk'] <- 8
data$attend_num = as.numeric(data$attend_num)
# select
north <- data %>% filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
south <- data %>% filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
pacific <- data %>% filter(region == 'pacific')
# multiple regression
# happiness predicted by hours watching tv,
# attendance at religious services, ow
north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = north,
family = 'binomial')
south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = south,
family = 'binomial')
pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = pacific,
family = 'binomial')
# evaluate model
summary(north_atlantic_model)
summary(south_central_model)
summary(pacific_model)
})
tibble
fast or slow? Pay attention to the unit. That is how different are the durations in absolute and relative terms? Do these differences matter? Consider, how often would you have to select a variable from either object in order to take up more than 1 second?# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(10000, 1),
'var_2' = rnorm(10000, 1))
tbl <- as.tibble(df)
# microbenchmark
microbenchmark(df[['var_1']], df$var_1, tbl$var_1)
## Unit: microseconds
## expr min lq mean median uq max neval
## df[["var_1"]] 3.848 4.2075 4.54733 4.3655 4.4905 21.767 100
## df$var_1 4.621 5.0625 5.46516 5.3025 5.4650 19.734 100
## tbl$var_1 1.172 1.2950 2.07495 1.3960 1.4890 65.521 100
.subset2()
in the comparison. That is include .subset2(df, 'var1')
and .subset2(tbl, 'var1')
as additional arguments in microbenchmark. What do you find?# load packages
library(microbenchmark)
library(tibble)
# get data
df <- data.frame('var_1' = rnorm(1000,1),
'var_2' = rnorm(1000,1))
tbl <- as.tibble(df)
# microbenchmark
microbenchmark(df[['var_1']], df$var_1, tbl$var_1,
.subset2(df,'var_1'), .subset2(tbl,'var_1'))
## Unit: nanoseconds
## expr min lq mean median uq max neval
## df[["var_1"]] 3097 3232.0 3473.25 3377.0 3510.0 5289 100
## df$var_1 3802 4008.0 4429.09 4155.0 4339.5 21197 100
## tbl$var_1 930 1026.5 1277.52 1100.0 1189.0 14171 100
## .subset2(df, "var_1") 149 167.0 205.24 181.5 193.5 2166 100
## .subset2(tbl, "var_1") 148 168.5 188.72 184.5 196.5 330 100
mean()
to a function that uses its basic ingredients sum()
and length()
, i.e., my_mean_fun <- function(my_vec) sum(my_vec) / length(my_vec)
. To do this, first create a vector consisting of random numbers using runif()
(see ?runif
). Make sure to use a long enough vector, e.g., my_vec <- runif(1000000)
. Then test the original and your own mean function using microbenchmark()
. What do you find? Which function is faster, yours or R’s?# define vector
my_vec <- runif(1000000)
# define my mean fun
my_mean_fun <- function(my_vec) sum(my_vec) / length(my_vec)
# microbenchmark
microbenchmark(mean(my_vec), my_mean_fun(my_vec))
## Unit: microseconds
## expr min lq mean median uq
## mean(my_vec) 1552.516 1572.177 1625.1688 1596.487 1634.542
## my_mean_fun(my_vec) 778.201 780.473 983.0044 804.022 830.401
## max neval
## 2248.554 100
## 17467.716 100
# microbenchmark
microbenchmark(mean(my_vec), my_mean_fun(my_vec), .Internal(mean(my_vec)))
## Unit: microseconds
## expr min lq mean median uq
## mean(my_vec) 1594.226 1669.794 1740.7203 1708.118 1765.457
## my_mean_fun(my_vec) 781.397 839.697 878.4002 849.331 894.541
## .Internal(mean(my_vec)) 1580.258 1668.694 1704.3733 1685.894 1726.849
## max neval
## 2351.056 100
## 1242.571 100
## 2133.222 100
mean()
, sum()
, length()
using typeof()
. Noticed a difference between mean()
and the rest? Now, test each of these functions again using is.primitive()
. What do you find?# test type
typeof(mean); typeof(sum); typeof(length)
## [1] "closure"
## [1] "builtin"
## [1] "builtin"
Explanation: Many functions in R are so-called primitives
. This means that the function has not been implemented in R, but in a lower-level language such as C. Using such functions is almost always preferable, if speed is of the essence.
Let us now turn to a more complex code example. In the following exercises, we want to profile a larger code chunk and then figure out ways to make it more efficient. To do this, you first need to install and load the profvis
package.
profvis()
function on the code. What parts of the code are most computationally expensive?profvis({
# load data
data <- read_csv('https://tinyurl.com/happy-csv')
# remove rownames
data <- data[-1]
# mutate
data$attend_num = data$attend
data$attend_num[data$attend == 'never'] <- 0
data$attend_num[data$attend == 'lt once a year'] <- 1
data$attend_num[data$attend == 'once a year'] <- 2
data$attend_num[data$attend == 'sevrl times a yr'] <- 3
data$attend_num[data$attend == 'once a month'] <- 4
data$attend_num[data$attend == '2-3x a month'] <- 5
data$attend_num[data$attend == 'nrly every week'] <- 6
data$attend_num[data$attend == 'every week'] <- 7
data$attend_num[data$attend == 'more thn once wk'] <- 8
data$attend_num = as.numeric(data$attend_num)
# select
north <- data %>% filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
south <- data %>% filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
pacific <- data %>% filter(region == 'pacific')
# multiple regression
# happiness predicted by hours watching tv,
# attendance at religious services, ow
north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = north,
family = 'binomial')
south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = south,
family = 'binomial')
pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = pacific,
family = 'binomial')
# evaluate model
summary(north_atlantic_model)
summary(south_central_model)
summary(pacific_model)
}, interval = .005)
In attempting to make code efficient, the first question should always be, is it really necessary. In this case, this is actually not the case. But, let’s pretend it was. Clearly, there is some part of the code that take much more time than the rest. Let’s try to fix these parts by making them faster. Why it is the code slow? Is it R or is there some other reason?
6. To speed up the reading of data, write it onto the hard drive (inside your project) using write_csv()
. Now, change the code so that the file is not read from the URL, but from your hard drive? Has the code become faster? By how much?
# ---- loading the data and saving it to disk
data <- read_csv('https://tinyurl.com/happy-csv')
write_csv(data, 'data/happiness.csv')
# ---- profile code
profvis({
# load data
data <- read_csv('data/happiness.csv')
# remove rownames
data <- data[-1]
# mutate
data$attend_num = data$attend
data$attend_num[data$attend == 'never'] <- 0
data$attend_num[data$attend == 'lt once a year'] <- 1
data$attend_num[data$attend == 'once a year'] <- 2
data$attend_num[data$attend == 'sevrl times a yr'] <- 3
data$attend_num[data$attend == 'once a month'] <- 4
data$attend_num[data$attend == '2-3x a month'] <- 5
data$attend_num[data$attend == 'nrly every week'] <- 6
data$attend_num[data$attend == 'every week'] <- 7
data$attend_num[data$attend == 'more thn once wk'] <- 8
data$attend_num = as.numeric(data$attend_num)
# select
north <- data %>% filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
south <- data %>% filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
pacific <- data %>% filter(region == 'pacific')
# multiple regression
# happiness predicted by hours watching tv,
# attendance at religious services, ow
north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = north,
family = 'binomial')
south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = south,
family = 'binomial')
pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = pacific,
family = 'binomial')
# evaluate model
summary(north_atlantic_model)
summary(south_central_model)
summary(pacific_model)
}, interval = .005)
data.table
package happens to have such a function. It’s called fread()
and it works the same way as read_csv()
. Try it and see what happens to the code execution time. Any changes to the profile?Note: You want to run this twice to let the compiler kick in.
# load data.table
require(data.table)
# ---- profile code
profvis({
# load data
data <- fread('data/happiness.csv')
# remove rownames
data <- data[-1]
# mutate
data$attend_num = data$attend
data$attend_num[data$attend == 'never'] <- 0
data$attend_num[data$attend == 'lt once a year'] <- 1
data$attend_num[data$attend == 'once a year'] <- 2
data$attend_num[data$attend == 'sevrl times a yr'] <- 3
data$attend_num[data$attend == 'once a month'] <- 4
data$attend_num[data$attend == '2-3x a month'] <- 5
data$attend_num[data$attend == 'nrly every week'] <- 6
data$attend_num[data$attend == 'every week'] <- 7
data$attend_num[data$attend == 'more thn once wk'] <- 8
data$attend_num = as.numeric(data$attend_num)
# select
north <- data %>% filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
south <- data %>% filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
pacific <- data %>% filter(region == 'pacific')
# multiple regression
# happiness predicted by hours watching tv,
# attendance at religious services, ow
north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = north,
family = 'binomial')
south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = south,
family = 'binomial')
pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun,
data = pacific,
family = 'binomial')
# evaluate model
summary(north_atlantic_model)
summary(south_central_model)
summary(pacific_model)
}, interval = .005)
# ---- profile code
profvis({
# load data
data <- fread('data/happiness.csv')
# process data
data_list = as.tibble(data %>%
filter(region %in% c('new england', 'e. nor. central','w. nor. central',
'w. sou. central', 'e. sou. central', 'south atlantic',
'pacific')) %>%
mutate(
attend_num = case_when(
attend == 'never' ~ 0,
attend == 'lt once a year' ~ 1,
attend == 'once a year' ~ 2,
attend == 'sevrl times a yr' ~ 3,
attend == 'once a month' ~ 4,
attend == '2-3x a month' ~ 5,
attend == 'nrly every week' ~ 6,
attend == 'every week' ~ 7,
attend == 'more thn once wk' ~ 8
),
region_groups = case_when(
region %in% c('new england', 'e. nor. central','w. nor. central') ~ 'north',
region %in% c('w. sou. central', 'e. sou. central', 'south atlantic') ~ 'south',
region == 'pacific' ~ 'pacific'
))) %>%
split(., .$region_groups)
# define test fun
my_test <- function(data){
mod = glm(vhappy ~ tvhours + attend_num + owngun,
data = data, family = 'binomial')
return(summary(mod))
}
# run test fun
results = lapply(data_list, my_test)
# print results
results
}, interval = .001)
In most cases, it is possible to produce efficient code using existing R functions. Sometimes, however, data sets are so big or computations so complex that additional measures must be taken. Generally, there are two fundamental ways to make code really fast: (a) implement code in a lower-level language that inherently runs faster and (b) running code in parallel. In this section you will be getting a taste of both of these approaches.
Running code in parallel requires using a different programming style, but not by much. To make use of parallel execution, we must split the data into parcels, often called jobs, and specify a function build to process individual parcels, often called a worker. Once both of them are specified, we can use the functions in R’s parallel
package to assign jobs to workers and collect the results. Let’s begin by dealing with the two ingredients, the jobs and the worker. In this section, we will be looking at a very simple task, to compute the means of every column in a matrix.
my_df = as.data.frame(matrix(as.double(1:100000000), ncol = 100))
, which has 100 columns and 100,000 rows, and seek to split it into 100 individual parcels of 1,000 rows each (Note: as.double
is needed to avoid integer overflow). One easy way of doing this is to use the split()
function. The split()
function is build to slice data frames row-wise as a function of the levels of another variable. That is, if we had a vector that had one value for the first 1,000 entries, another for the next 1,000 entries and so on, then split
would create separate data frames for each different value and, thus, 100 parcels of 1,000 rows each. One easy way to create such a vector is to use rep()
. rep()
creates vectors by repeating its arguments’ values. For instance, rep(10, 5)
will create 10 10 10 10 10
. To create a vector with the said properties, we can make use of the fact that rep()
is vectorized. That is, we can provide vectors for both arguments of the function. To cut things short, we can create our split vector using split_vec <- rep(1:100, rep(1000, 100))
. Now apply split()
to slice my_df
using split_vec
into 100 separate jobs. What type of object does split()
return?# my_mat
my_df = as.data.frame(matrix(as.double(1:100000000), ncol = 100))
# split_vec
split_vec <- rep(1:100, rep(1000, 100))
# jobs
jobs <- split(my_df, split_vec)
sum(c(sum(worker_1), sum(worker_2), ..., sum(worker_100)) / 100000
. Thus, what we need is a function that computes the sums of columns of each slice of the data frame. It so happens that R has a convenient (but not exactly fast) tool to do this, called apply()
. apply()
is a function that allows you to apply functions to data frames (or matrices) row or column wise. apply(my_df, 2, sd)
, for instance, iterates through the columns (1 = rows, 2 = columns) and calculates the standard deviation using sd()
. To instead calculate the sum, we need to merely replace sd
by sum
, i.e., to use apply(my_df, 2, sum)
. Now, we have our worker. The last thing we want to do is to store the worker in its own function. That is, we want to run worker <- function(df) apply(df, 2, sum)
. Let’s do this.# define worker
worker <- function(df) apply(df, 2, sum)
parallel
package, which for a couple of years now is part of the standard R distribution The first step in running jobs in parallel is to set up a cluster of R instances. To do this, we use the function makeCluster()
. Specify the number of R instances to be contained in the cluster and assign the cluster to an object, e.g., cl <- makeCluster(8)
. I have chosen 8 because I have 8 (virtual) processors in my computer. You may have more or less. It makes sense to choose a number that is close to the number of processors in your computer. If you don’t know how many processors you have, test it using detectCores()
. OK, now that we have our cluster cl
the next step is to actually run the jobs in parallel. To do this, we use clusterApplyLB()
, which takes three arguments: (1) the cluster object, (2) the jobs, and (3) the worker, each of which we should now be ready to supply. Use clusterApplyLB(cl, jobs, worker)
and don’t forget to assign the output to a new object. The last part of running parallel is to stop the cluster using stopCluster(cl)
. Try it out and store the result in my_result
.# load parallel
library(parallel)
# create cluster
cl <- makeCluster(8)
# fun in parallel
my_result <- clusterApplyLB(cl, jobs, worker)
# stop cluster
stopCluster(cl)
do.call()
. Try my_result_mat <- do.call(rbind, my_result)
. The only thing left to do now, is to compute again the sums of each column - here we can again make use of our worker - and then to divide the results by the total number of rows. Let’s do this.# bind results to matrix
my_result_mat <- do.call(rbind, my_result)
# compute sums and divide
worker(my_result_mat) / nrow(my_df)
## V1 V2 V3 V4 V5 V6
## 500000.5 1500000.5 2500000.5 3500000.5 4500000.5 5500000.5
## V7 V8 V9 V10 V11 V12
## 6500000.5 7500000.5 8500000.5 9500000.5 10500000.5 11500000.5
## V13 V14 V15 V16 V17 V18
## 12500000.5 13500000.5 14500000.5 15500000.5 16500000.5 17500000.5
## V19 V20 V21 V22 V23 V24
## 18500000.5 19500000.5 20500000.5 21500000.5 22500000.5 23500000.5
## V25 V26 V27 V28 V29 V30
## 24500000.5 25500000.5 26500000.5 27500000.5 28500000.5 29500000.5
## V31 V32 V33 V34 V35 V36
## 30500000.5 31500000.5 32500000.5 33500000.5 34500000.5 35500000.5
## V37 V38 V39 V40 V41 V42
## 36500000.5 37500000.5 38500000.5 39500000.5 40500000.5 41500000.5
## V43 V44 V45 V46 V47 V48
## 42500000.5 43500000.5 44500000.5 45500000.5 46500000.5 47500000.5
## V49 V50 V51 V52 V53 V54
## 48500000.5 49500000.5 50500000.5 51500000.5 52500000.5 53500000.5
## V55 V56 V57 V58 V59 V60
## 54500000.5 55500000.5 56500000.5 57500000.5 58500000.5 59500000.5
## V61 V62 V63 V64 V65 V66
## 60500000.5 61500000.5 62500000.5 63500000.5 64500000.5 65500000.5
## V67 V68 V69 V70 V71 V72
## 66500000.5 67500000.5 68500000.5 69500000.5 70500000.5 71500000.5
## V73 V74 V75 V76 V77 V78
## 72500000.5 73500000.5 74500000.5 75500000.5 76500000.5 77500000.5
## V79 V80 V81 V82 V83 V84
## 78500000.5 79500000.5 80500000.5 81500000.5 82500000.5 83500000.5
## V85 V86 V87 V88 V89 V90
## 84500000.5 85500000.5 86500000.5 87500000.5 88500000.5 89500000.5
## V91 V92 V93 V94 V95 V96
## 90500000.5 91500000.5 92500000.5 93500000.5 94500000.5 95500000.5
## V97 V98 V99 V100
## 96500000.5 97500000.5 98500000.5 99500000.5
system.time()
. One convenient way to compare the two is store all of the parallel code, beginning with splitting the matrix and ending with calculating the sums of sums and dividing by N, inside a function. What do you observe?# my_df
my_df = as.data.frame(matrix(as.double(1:100000000), ncol = 100))
# ---- define parallel fun
my_parallel_fun <- function(n_jobs = 100){
# split_vec
split_vec <- rep(1:n_jobs, rep(nrow(my_df) / n_jobs, n_jobs))
# jobs
jobs <- split(my_df, split_vec)
# create cluster
cl <- makeCluster(8)
# fun in parallel
my_result <- clusterApplyLB(cl, jobs[sample(1:length(jobs))], worker)
# stop cluster
stopCluster(cl)
# bind results to matrix
my_result_mat <- do.call(rbind, my_result)
# compute sums and divide
worker(my_result_mat) / nrow(my_df)
}
# ---- compare speed
system.time(my_parallel_fun(100))
## user system elapsed
## 2.911 0.503 5.550
system.time(apply(my_df, 2, mean))
## user system elapsed
## 1.690 1.077 2.879
Was the parallel version really faster than the non-parallel version? I would not be surprised, if not. Parallelization needs to be carefully set up and may in some cases even hurt. In this case, the reason why the parallel code may have been slower is that it takes R quite some time to set up the R instances in the background. It’s literally the same as you opening R on your computer several times in a row. Thus, (this kind of) parallelization will only be worth it, when the entire job runs in minutes or more rather than in seconds.
Another option to speed up code is to implement essential code chunks in C++ using Rcpp. C++ is a compiled language, meaning that any C++ code will be translated into highly efficient machine code, before it is executed. For this reason, implementing R code in C++ and importing it using Rcpp can mean 10-100x faster code execution (if the original code wasn’t C code in the first place; remember primitives
). Using high performance libraries such as, e.g., RcppArmadillo, and multi-threading, e.g., using RcppParallel, further speed ups are possible.
Essentially, using C++, C++-libraries, and multi-threading allows you to tap into the full computing power of your computer. Unless you are a highly experienced programmer, who is specialized in optimizing code, you will not be able to ever write code that is faster than code that has been implemented using these tools, irrespective of what programming language you are using. To tap into this potentially, however, it is necessary to know at least a little bit of C++. As this is a course on R, I will provide the the necessary code below. The only thing you will have to do is to save the code in C++ (aka Cpp) scripts, load them into R, and compare the performance to R’s own implementations. We begin again with the example of calculating column means of a data frame.
#include <Rcpp.h>
using namespace Rcpp;
NumericVector col_means_cpp(DataFrame& df) {
int n_rows = df.nrows(), n_cols= df.size() ;
NumericVector means(n_cols);
for(int j = 0; j < n_cols; ++j) {
NumericVector column = df[j];
double sum = 0;
for(int i = 0; i < n_rows; ++i){
sum += column[i];
}
means[j] = sum / n_rows;
}
return means;
}
sourceCpp()
from the Rcpp
package. Provide to sourceCpp
the file path to our function and execute it. If there is no error, this means that we can now use col_means_cpp()
, which is the name of the function, just as any other R function. As its only argument, it takes a data frame. Thus, we can easily compare it to our worker function that we defined above. Use system.time()
to measure the speed of our new function and compare it to our worker. How much faster is the new function? Try also to compare it to colMeans()
, which is R’s own “high-performance” version of this function. Who wins?Why does the result of this comparison differ from that in the slides?
# ----- Load C++ code
# load Rcpp
library(Rcpp)
# load C++ function
sourceCpp('C++_code/col_mean_cpp.cpp')
# compare to alternative implementations
system.time(worker(my_df))
## user system elapsed
## 1.434 0.829 2.340
system.time(col_means_cpp(my_df))
## user system elapsed
## 0.086 0.000 0.086
system.time(colMeans(my_df))
## user system elapsed
## 0.434 0.217 0.651
lm
function. To see how, take a look at the R code below. Note: before you can load this function you must install.packages(RcppArmadillo)
.#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
List fast_lm(const arma::mat& X, const arma::colvec& y) {
int n = X.n_rows, k = X.n_cols;
arma::colvec coef = arma::solve(X, y); // fit model y ~ X
arma::colvec res = y - X*coef; // residuals
// std.errors of coefficients
double s2 = std::inner_product(res.begin(), res.end(), res.begin(), 0.0)/(n - k);
arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
return List::create(Named("coefficients") = coef,
Named("stderr") = std_err,
Named("df.residual") = n - k);
}
# Create data
d = data.frame(
a = rnorm(10000,0,1),
b = rnorm(10000,0,1),
c = rnorm(10000,0,1)
)
# Load C++ code
sourceCpp('C++_code/fast_lm.cpp')
# Compare using microbenchmark
microbenchmark(
lm(c ~ a + b, data = d),
fast_lm(as.matrix(d[,c('a','b')]), d$c)
)
## Unit: microseconds
## expr min lq mean
## lm(c ~ a + b, data = d) 1936.927 2179.7845 3673.6864
## fast_lm(as.matrix(d[, c("a", "b")]), d$c) 357.595 443.0785 745.9288
## median uq max neval
## 2977.674 4208.382 17420.984 100
## 563.241 752.713 5503.501 100
Lastly, I want to touch upon yet another option to speed up code, namely to use a different implementations of R. One such implementation is Microsoft Open R. Open R features the high-performance, multi-threaded BLAS/LAPACK linear algebra libraries plus several other features that aim to maximize reproducibility.
In this exercise, you will be comparing to chunks of code in R and Open R to see where the performance benefits of Open R lie and where not. To do this first install Open R. This will create a separate R instance called Open R. Start it and run the code examples below, once in Open R and once in your regular R. You can see that one of the examples runs dramatically faster, although it is already very efficiently implemented in regular R. However, the other one doesn’t. This implies that Open R provides not yet substantial benefits for regular users of R.
WARNING: If you do install Open R, then it will become your default R version. As it is not yet compatible with the contents of the following sessions you will have to revert back to the original R. The easiest but unfortunately slightly time consuming way is to re-install the original R.
# function that times functions
timer <- function(expr){
t <- proc.time()[3]
expr
proc.time()[3] - t
}
# ---- Matrix multiplication
# create matrix
m <- matrix(1:1000000, ncol = 100)
# function that mutliplies matrix with itself
m_square <- function(m) a = m %*% t(m)
# time m_square
timer(m_square(m))
# ---- Regression
# create matrix
d = data.frame(
a = rnorm(1000000,0,1),
b = rnorm(1000000,0,1),
c = rnorm(1000000,0,1)
)
# time m_square
timer(lm(a ~ b * c, data = d))
For more details check out check out Hadley Wickham’s Advanced R.
For more details on parallel computing see Parallel R by McCallum and Weston.
For more details R’s C++ interface see Rccp by McCallum and Weston.
For more details on the fundamentals check out R Internals.