Trulli
source

Overview

In this practical you’ll learn how to create efficient programs in R. By the end of this practical you will know how to:

  1. Profile your code to identify critical parts.
  2. Make your program more efficient.
  3. How to do parallel computing.

Packages

Package Installation
microbenchmark install.packages("microbenchmark")
profvis install.packages("profvis")

Glossary

Function Description
proc.time() Returns the time
system.time() Runs one expression once and returns elapsed time
microbenchmark() Runs one or many expressions multiple times and returns statistics on elapsed time
profvis() valuates large expressions or entire scripts.

Examples

# Example: Microbenchmark ---------------------------------------   

# Minimal chunks of code can conveniently be tested
# using `microbenchmark()`. 

# load packages
library(microbenchmark)
library(tibble)

# get data
df  <- data.frame('var_1' = rnorm(10000,1),
                  'var_2' = rnorm(10000,1))
tbl <- as_tibble(df)

# microbenchmark: selecting variables
microbenchmark(df[['var_1']],
               df$var_1,
               tbl$var_1)

# Example: Profvis ---------------------------------------   
 
# Larger code chunks or even scripts can be conveniently tested using 
# `system.time()` and `profvis()` from the `profvis` package. The 
# example below is based on the `happiness` data set from the 
# `wooldridge` package.   

# install and load packages
install.packages('profvis')
library(profvis)
library(readr)
library(dplyr)

# profile using profvis
profvis({
  
  # load data
  data <- read_csv('https://tinyurl.com/happy-csv')
  
  # remove rownames
  data <- data[-1]
  
  # mutate
  data$attend_num = data$attend
  data$attend_num[data$attend == 'never'] <- 0
  data$attend_num[data$attend == 'lt once a year'] <- 1
  data$attend_num[data$attend == 'once a year'] <- 2
  data$attend_num[data$attend == 'sevrl times a yr'] <- 3
  data$attend_num[data$attend == 'once a month'] <- 4
  data$attend_num[data$attend == '2-3x a month'] <- 5
  data$attend_num[data$attend == 'nrly every week'] <- 6
  data$attend_num[data$attend == 'every week'] <- 7
  data$attend_num[data$attend == 'more thn once wk'] <- 8
  data$attend_num = as.numeric(data$attend_num)
  
  # select
  north <- data %>%
    filter(region %in% c('new england', 'e. nor. central','w. nor. central'))
  south <- data %>%
    filter(region %in% c('w. sou. central', 'e. sou. central', 'south atlantic'))
  pacific <- data %>%
    filter(region == 'pacific')
  
  # multiple regression 
  # happiness predicted by hours watching tv, 
  # attendance at religious services, ow
  north_atlantic_model <- glm(vhappy ~ tvhours + attend_num + owngun, 
                              data = north, 
                              family = 'binomial')
  south_central_model <- glm(vhappy ~ tvhours + attend_num + owngun, 
                             data = south, 
                             family = 'binomial')
  pacific_model <- glm(vhappy ~ tvhours + attend_num + owngun, 
                       data = pacific, 
                       family = 'binomial')
  
  # evaluate model
  summary(north_atlantic_model)
  summary(south_central_model)
  summary(pacific_model)
  
  }, interval = .005)

Tasks

A - Microbenchmark

  1. Run the microbenchmark example from above. What do you observe? Is retrieving a variable from a tibble faster or slower than retrieving a variable from a data.frame?

  2. Re-evaluate the results while paying attention to the temporal unit. How different are the durations in absolute and relative terms and do the differences matter in realistic settings? How often would you have to select a variable from either object in order to take up more than 1 second?

  3. Although the absolute duration of selecting a variable may be small, let us nonetheless look at another option. Rerun microbenchmark example and including the function .subset2(). Specifically, add .subset2(df, 'var1') and .subset2(tbl, 'var1') as additional arguments in the microbenchmark. What do you find?

  4. There is a particular reason for why some functions are faster than others. But before we go into that, let’s look at another striking example. Compare the function mean() to a function that uses its basic ingredients sum() and length(). You can define such a function using my_mean_fun <- function(my_vec) sum(my_vec) / length(my_vec). First, create a vector consisting of random numbers using runif() (see ?runif), e.g., my_vec <- runif(1000000). Then, test the original and your own mean function using microbenchmark(). What do you find? Which function is faster, yours or R’s?

  5. So, why are some functions faster than others? Test the type of mean, sum, length using typeof(). Noticed a difference between mean() and the rest? Now, test each of these functions again using is.primitive(). What do you find?

Explanation: Many functions in R are so-called primitives. This means that the function has not been implemented in R, but in a lower-level language such as C or C++. When speed is important, it is preferable to resort to primitive functions.

B - Profiling

Let us now turn to a more complex code example. In this section, you will profile a larger code chunk on the level of a small analysis script to identify opportunities to make the code more efficient. Before you start, please install and load the profvis package.

  1. Copy the profiling example above into a new script file, save it into your project, and run it. What parts of the code are most computationally expensive?

  2. There was some part of the code that took much more time than the rest. Why is the code slow? Is it R or could there be some other reason?

C - Making code efficient

Now, try to make the code more efficient.

  1. To speed up the critical part, the reading in of data, save the data on the hard drive (inside your project) using write_csv() and change the code so that the file is read from your hard drive? Has the code become faster? By how much?

  2. Still reading in the data takes most time. Try using a different function that has been built fast reading of data. The data.table package happens to have such a function. It’s called fread() and it works the same way as read_csv(). Any changes to the profile?

Note: You want to run this twice to let the compiler kick in.

  1. Finally, remember that efficient programs are not only fast, but also readable and maintainable. Try to apply what you have learned about data wrangling to improve this piece of code. Make it as readable and maintainable as you can. And, who knows, maybe the code also gets a little faster.

Advanced section

In most cases, it is possible to produce efficient code using existing R functions. Sometimes, however, data sets are so big or computations is so complex that additional measures must be taken. Generally, there are two fundamental ways to make code really fast: (a) to implement code in a lower-level language that runs inherently faster and (b) to run code in parallel. In this section you will be getting a taste of both of these approaches.

X - Parallel R

Running code in parallel requires using a somewhat different programming approach. That is, you must split the data into parcels, often called jobs, and specify a function built to process individual parcels, often called a worker. Then, you can use functions in R’s parallel package to assign jobs to workers and collect the results. Let’s begin by defining the jobs and the worker. In this section, we will be looking at a very simple task, to compute the means of every column in a matrix, e.g., my_df = as.data.frame(matrix(as.double(1:100000000), ncol = 100)). (Note: as.double is needed to avoid integer overflow)

  1. Define jobs by splitting the matrix into individual parcels. This could be done by rows or by columns, but we recommend splitting the data by rows. To split the matrix into, e.g., parcels of 1,000 rows, use the split() function. The split() function is build to slice data frames row-wise as a function of the levels of another variable. Thus, to use the function create a vector with one value for the first 1,000 entries, another for the next 1,000 entries and so on. You can create such a vector using rep(). rep(). For instance, rep(10, 5) will create 10 10 10 10 10. To create a vector with the needed properties, make use of the fact that rep() is vectorized, i.e., split_vec <- rep(1:100, rep(1000, 100)). Now apply split() to slice my_df using split_vec into 100 separate jobs. What type of object does split() return?

  2. Now, define your worker. The goal is to compute the mean of matrix’ columns. However, since we sliced the data frame in 100 pieces of 1,000 rows each, every worker will only have access to pieces of each variable. Thus, we need to find a way to also split the computation of the mean. One way to do this is to have the workers compute the sum of all the values provided to them and then later, when the results of all of the workers have been collected, to compute the sum of sums and divide it by the total number of observations. In pseudo-code: sum(c(sum(worker_1), sum(worker_2), ..., sum(worker_100)) / 100000. One function to implement this is called apply(). apply() allows you to apply functions to data frames (or matrices) row or column wise. apply(my_df, 2, sd), for instance, iterates through the columns (Note: 1 = rows, 2 = columns) and calculates the standard deviation using sd(). To instead calculate the sum, replace sd by sum. The last thing element in creating a worker, is to store it in its own function using. That is, define worker <- function(df) apply(df, 2, sum).

  3. Now that you created a jobs object and a worker function, run them in parallel. Do this using R’s parallel package, which recently has become part of the standard R distribution. The first step is to initialize a cluster of R instances using makeCluster(). Specify the size of the cluster(i.e., number of R instances) and assign the cluster to an object, e.g., cl <- makeCluster(8). I have chosen 8 because I have 8 (virtual) processors in my computer. You may have more or fewer processors. It makes sense to choose a number that is close to the number of processors in your computer. You can find out the number of processors in your computer using detectCores(). The next step is to run the jobs in parallel using clusterApplyLB(). The function takes three arguments: (1) the cluster object, (2) the jobs, and (3) the worker in that order, i.e., clusterApplyLB(cl, jobs, worker). When you run it don’t forget to assign the output to a new object. Finally, stop the cluster using stopCluster(cl).

  4. Now process the results. This means that you sum up the values produced by the workers separately for each column and divide the respective sums by the total number of rows. One convenient way of doing this is to bind the results into a matrix, where the columns represent the original columns and the rows represent the number of jobs. Create this matrix using my_result_mat <- do.call(rbind, my_result). Then compute the sums of each column - here we you can make use of your worker - and to divide the results by the total number of rows.

  5. Finally, compare the speed of applying the worker to the entire object versus running it in parallel using system.time(). One convenient way to compare the two is store all of the parallel code, beginning with splitting the matrix and ending with calculating the sums of sums and dividing by N, inside a function. What do you observe?

Was the parallel version really faster than the non-parallel version? I would not be surprised, if not. Parallelization needs to be carefully set up and may, in some cases, even hurt execution time. In this case, the reason why the parallel code may have been slower is that it takes R quite some time to set up the R instances in the background. It’s literally the same as you opening R on your computer several times in a row. Thus, (this kind of) parallelization will only be worth it, when the entire job runs in minutes, hours, or more.

Y - Rcpp

Another option to speed up code is to implement essential code chunks in C++ using Rcpp. C++ is a compiled language, meaning that any C++ code will be translated into highly efficient machine code, before it is executed, making the code execution extremely fast. Implementing R code in C++ and importing it using Rcpp can mean 10-100x faster code execution (if the original code wasn’t C code in the first place; remember primitives). Using high performance libraries such as, e.g., RcppArmadillo, and multi-threading, e.g., using RcppParallel, further speed ups are possible.

Using C++, C++-libraries, and multi-threading allows you to tap into the full computing power of your computer. To tap into this potentially it will be necessary to know at least a little bit of C++, but not much. To make things easy, the necessary code will be provided to you. What you will do in next tasks is to save the code in a C++ (aka Cpp) script, load it into R, and compare the performance to R’s own implementations. We begin again with the example of calculating column means of a data frame.

  1. Below is the C++ code. Your first task is to copy this code into a new C++ script file. To do this, select File/New File/C++ File from the RStudio menu. Next delete everything in that script and replace it by the code below. Finally, save the File inside your project.
#include <Rcpp.h>
using namespace Rcpp;

NumericVector col_means_cpp(DataFrame& df) {
  int n_rows = df.nrows(), n_cols= df.size() ;
  NumericVector means(n_cols);
  for(int j = 0; j < n_cols; ++j) {
    NumericVector column = df[j];
    double sum = 0;
    for(int i = 0; i < n_rows; ++i){
      sum += column[i];
      }
    means[j] = sum / n_rows;  
    }
  return means;
  }
  1. Now load the C++ script and with that the function col_means_cpp() into R. To do this, use sourceCpp() from the Rcpp package. Provide to sourceCpp the file path to our function and execute it. As a result you can use col_means_cpp() just as any other R function. As its only argument, it takes a data frame, of which it will compute the. This means, you can easily compare it to our worker function that we defined above. Use system.time() to measure the speed of our new function and compare it to our worker. How much faster is the new function? Try also to compare it to colMeans(), which is R’s own high-performance version of this function. Who wins?

Why does the result of this comparison differ from that in the slides?

  1. Finally, I want to show you another example that illustrates that these tools can also be used to make statistical tests faster. Below is a piece of code that provides a faster implementation of the general linear model (aka regression). Follow the same steps as above to load it into R and then compare it to R’s own lm function. To see how, take a look at the R code below. Note: before you can load this function you must install.packages(RcppArmadillo).
#include <RcppArmadillo.h>
using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
List fast_lm(const arma::mat& X, const arma::colvec& y) {
  int n = X.n_rows, k = X.n_cols;
  
  arma::colvec coef = arma::solve(X, y);    // fit model y ~ X
  arma::colvec res  = y - X*coef;           // residuals
  
  // std.errors of coefficients
  double s2 = std::inner_product(res.begin(), res.end(), res.begin(), 0.0)/(n - k);
  
  arma::colvec std_err = arma::sqrt(s2 * arma::diagvec(arma::pinv(arma::trans(X)*X)));
  
  return List::create(Named("coefficients") = coef,
                      Named("stderr")       = std_err,
                      Named("df.residual")  = n - k);
  }
# Create data
d = data.frame(
    a = rnorm(10000,0,1),
    b = rnorm(10000,0,1),
    c = rnorm(10000,0,1)
    )

# Load C++ code
sourceCpp('C++_code/fast_lm.cpp')

# Compare using microbenchmark
microbenchmark(
  lm(c ~ a + b, data = d),
  fast_lm(as.matrix(d[,c('a','b')]), d$c)
  )

Z - Microsoft R Open

Lastly, I want to touch upon yet another option to speed up code, namely to use a different implementations of R. One such implementation is Microsoft Open R. Open R features the high-performance, multi-threaded BLAS/LAPACK linear algebra libraries plus several other features that aim to maximize reproducibility.

In this exercise, you will be comparing to chunks of code in R and Open R to see where the performance benefits of Open R lie and where not. To do this first install Open R. This will create a separate R instance called Open R. Start it and run the code examples below, once in Open R and once in your regular R. You can see that one of the examples runs dramatically faster, although it is already very efficiently implemented in regular R. However, the other one doesn’t. This implies that Open R provides not yet substantial benefits for regular users of R.

WARNING: If you do install Open R, then it will become your default R version. As it is not yet compatible with the contents of the following sessions you will have to revert back to the original R. The easiest but unfortunately slightly time consuming way is to re-install the original R.

# function that times functions
timer <- function(expr){ 
  t <- proc.time()[3]
  expr
  proc.time()[3] - t
  }
  
# ---- Matrix multiplication

# create matrix
m <- matrix(1:1000000, ncol = 100)

# function that mutliplies matrix with itself
m_square <- function(m) a = m %*% t(m)

# time m_square
timer(m_square(m))

# ---- Regression

# create matrix
d = data.frame(
    a = rnorm(1000000,0,1),
    b = rnorm(1000000,0,1),
    c = rnorm(1000000,0,1)
    )

# time m_square
timer(lm(a ~ b * c, data = d))

Additional reading