class: center, middle, inverse, title-slide # Efficient programming ### The R Bootcamp
www.therbootcamp.com
@therbootcamp
### July 2018 --- layout: true <div class="my-footer"><span> <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">Basel, July 2018</font></a>                                           <a href="https://therbootcamp.github.io/"><font color="#7E7E7E">www.therbootcamp.com</font></a> </span></div> --- # What is efficient code? .pull-left4[ ><font size = 5>"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered."<br> ><font size = 5>-- Donald Knuth ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/donald_knuth.jpeg" width="500"> <p align="center"><font size=3>Donald E. Knuth<br>Author of <a href:"https://de.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Programming</a><br>source<a href="http://www-cs-faculty.stanford.edu/"> http://www-cs-faculty.stanford.edu/</a> ] --- # Why is R slow? And is it? .pull-left45[ ><font size = 5>R is not a fast language. This is not an accident. R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer. While R is slow compared to other programming languages, for most purposes, it’s fast enough. ><font size = 5>-- Hadley Wickham ] .pull-right45[ <font size = 5>*Reasons for R being slow*<br> <high>Extreme dynamism</high> - allows you to code flexibly. - weak dynamic typing - copy-on-modify semantics - Pass-by-value <br2> <high>Name lookup</high> (in environments/namespaces) - allows you to import packages and name your objects flexibly. ] --- .pull-left35[ # When is R slow? <font size = 5> (R) programs are <high>slow</high> when a <high>large number computations</high> of computations are required.<br><br> Usually this involves, explicitly or implicitly (i.e., in the background), running a <high>loop</high>. </font> ] .pull-right55[ <br><br> ```r # define a vector of variables vars <- c('sex', 'age', 'height', 'income') # loop that copies my_vec = numeric() for(i in 1:length(vars)){ var <- baselers[[vars[i]]] for(j in 1:length(var)){ my_vec = c(my_vec, var[j]) } } # loop that doesn't copy my_vec = numeric(nrow(baselers) * length(vars)) ind = 0 for(i in 1:length(vars)){ var <- baselers[[vars[i]]] for(j in 1:length(var)){ ind = ind + 1 my_vec[ind] = var[j] } } ``` ] --- # Hierarchy of programming lanuguages .pull-left4[ There is a general <high>trade-off</high> between <high>readability and flexibility</high> on the one hand and <high>speed</high> on the other. Languages close to the machine are <high>fast but impossible to read</high> and very rigid, e.g., `Assembly`. <u>A function written in `Assembly`</u> <p align="center"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/assembly.png"> </p> ] .pull-right55[ <p align="center"> <img src="https://cdn-images-1.medium.com/max/1600/1*8j2PmhExz4q87OoddaH7ag.png" height = 400 vspace="10px"> </p> ] --- # Profiling The first step to making your code efficient is to <high>identify critical parts</high> of your code. There are several options... | Function| Package| Description| |:------|:------|:------------| | `proc.time()` |`base`|Returns the time.| | `system.time()`| `base`|Runs one expression once and returns elapsed CPU time| | `microbenchmark()`|`microbenchmark`| Runs one or many expressions multiple times and returns statistics on elapsed time.| | `profvis()`|`profvis`| Evaluates larger code chunks and entire scripts.| | `lineprof(), shine()`|`lineprof`| Similar to `profvis` but deprecated (From Hadley's Github)| --- # Profiling: Example Responsible for slow code are often <high>small parts of the code</high> that take orders of magnitudes longer than everything else. Profiling is about figuring out which parts of your code take so long. .pull-left45[ ```r profvis({ # load data data <- read_csv('data/baselers.csv') # mutate data <- data %>% mutate(happy_bin = case_when( happiness <= 7 ~ 0, happiness > 7 ~ 1) ) # multiple regression model <- lm(rich ~ tattoos + happy_bin, data = data) # evaluate model summary(model) }, interval = .005) ``` ] .pull-right5[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/profvis.png" width="500"> ] --- # Improving performance <font size = 4><i>beginner</i><br> <font size = 6> 1. Look for existing solutions<br> 2. Do less work<br> 3. Vectorise<br> 4. Parallelise<br> 5. Avoid copies<br> 6. Byte-code compile<br> </font> <br> <font size = 4><i>advanced</i><br> <font size = 6> 7. Rcpp<br> 8. Using a different R<br> </font> --- # Look for an existing solution .pull-left3[ Almost always your problem has been solved by someone else. <font size = 4><i>Look for solutions in:</i></font> <high>Base R</high> which can be amazingly fast. <high>Other packages</high> which often provide faster versions of one and the same function. <a href="http://google.com" align="center">google</a>, <a href="http://stackoverflow.com" align="center">stackoverflow</a>, <a href="http://rseek.org/" align="center">rseek</a> ] .pull-right6[ <div style="margin:0px auto; width:100%; float:right"> <div style="float:left"; margin:0; width:48%"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/stacksite.png" height="360" width="310" align="center"><br> <p align = "center"><a href="stackoverflow.com" align="center"><font size = 3>https://stackoverflow.com</font></a></p> </div> <div style="float:right"; margin:0; width:48%"> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/rseeksite.png" height="360" width="310" align="center"><br> <p align = "center"><a href="http://rseek.org" align="center"><font size = 3>http://rseek.org</font></a></p> </div> </div> ] --- # Do as little as possible .pull-left3[ <high>Do everything only once</high> (or exactly as often as needed). Don't repeated yourself. Use <high>tailor-made</high> functions, e.g., `data.table::fread()`. Use <high>primitive</high> functions, e.g., `sum(x)/length(x)` rather than `mean(x)`. <high>Be specific</high>, e.g., `unlist(x, use.names = F)` vs. `unlist()`. ] .pull-right65[ ```r # load package library(microbenchmark, quietly = T) # microbenchmark # microbenchmark( # local = read_csv('data/titanic.csv'), # fread = fread('data/titanic.csv'), # times = 10) ``` ] --- # Vectorise .pull-left35[ Whenever possible <high>use vector operations</high> or functions that do vectorized operations. In other words, <high>don't use loops</high>, stay away from all <high>apply idoms</high>, such as `apply()`, `sapply()`, `tapply()`, etc. Yet in other words, <high>use functions</high> that have been <high>implemented in C/C++</high>. ] .pull-right6[ ```r # create data my_data <- matrix(rnorm(1000000), ncol = 10) # microbenchmark microbenchmark( colMeans = colMeans(my_data), apply = apply(my_data, 2, mean), times = 10) ``` ``` ## Unit: microseconds ## expr min lq mean median uq max neval ## colMeans 837 859 878 871 880 934 10 ## apply 13314 15888 17645 17433 17794 25038 10 ``` ] --- .pull-left3[ # Avoid copies **R always copies.** whenever using `c()`, `cbind()`, `rbind()`, `paste()` R creates a copy large enough to contain its inputs. ] .pull-right65[ ```r # define vectors short_vec <- runif(10) long_vec <- runif(100) # define collapse function combine <- function(x) { my_vec = c() for(i in 1:length(x)) my_vec = c(my_vec, x[i]) } # microbenchmark microbenchmark( loop10 = combine(short_vec), loop100 = combine(long_vec), vec10 = c(short_vec), vec100 = c(long_vec) ) ``` ``` ## Unit: nanoseconds ## expr min lq mean median uq max neval ## loop10 2617 2930 26055 3180 3481 2239671 100 ## loop100 35294 56900 65350 61744 66720 162358 100 ## vec10 143 175 246 212 268 2291 100 ## vec100 343 415 922 474 648 7585 100 ``` ] --- .pull-left2[ # Byte-code<br>compilation R can be compiled to byte-code, to create a faster, <high>lower-level version of the code</high>. <high>R does this automatically</high> with the first execution of a function. Thus, the second time a function is executed it will be faster. ] .pull-right7[ ```r # define unneccessarily complex function and compile my_fun <- function(x, f) sapply(x, f) my_fun_c <- compiler::cmpfun(my_fun) # define some awfully complex data x <- list(1:10, letters, c(F, T), NULL) # microbenchmark microbenchmark(my_fun(x, is.null), my_fun_c(x, is.null), unit='us') ``` ``` ## Unit: microseconds ## expr min lq mean median uq max neval ## my_fun(x, is.null) 8.85 9.30 20.6 9.62 10.2 980.3 100 ## my_fun_c(x, is.null) 8.88 9.46 11.8 9.84 10.3 53.2 100 ``` ```r # microbenchmark again microbenchmark(my_fun(x, is.null), my_fun_c(x, is.null), unit='us') ``` ``` ## Unit: microseconds ## expr min lq mean median uq max neval ## my_fun(x, is.null) 8.36 8.68 9.1 8.82 9.06 16.7 100 ## my_fun_c(x, is.null) 8.33 8.61 9.3 8.79 9.00 39.6 100 ``` ] --- .pull-left2[ # Parallel<br>computing When working with large data one of the <high>best ways to speed up</high> execution is parallel execution. Parallel execution means splitting the data in many <high>jobs</high> and having many <high>workers</high> (separate R instances equipped with a worker function) complete the jobs in parallel. ] .pull-right7[ ```r # define data and splitted data (jobs) data <- matrix(rnorm(10000000), ncol = 10) split_data <- lapply(1:10, function(i) data[(1:1000)+(i-1)*1000, ]) # open cluster require(parallel) clu <- makeCluster(5) # my cluster fun my_cluster_fun <- function(split_data){ # apply cluster function out <- clusterApplyLB(clu, split_data, colMeans) # combine results colMeans(do.call(rbind, out)) } # microbenchmark microbenchmark(vectorz = colMeans(data), cluster = my_cluster_fun(split_data)) ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## vectorz 7.82 8.46 9.16 9.08 9.81 11.73 100 ## cluster 4.32 4.95 5.47 5.28 5.79 8.82 100 ``` ] --- .pull-left2[ # Rcpp Another very effective, but quite advanced option is to write essential code chunks in C++ using Rcpp - <high>R's C++ interface</high>. Many functions are already implemented in C++ or Fortran. Thus, large benefits will only be seen for <high>for custom functions</high>. <a href="http://dirk.eddelbuettel.com/code/rcpp/Rcpp-quickref.pdf"><high>Quick-Guide</high></a> ] .pull-right7[ ```r # define data my_data <- matrix(rnorm(10000000),ncol = 10) # define function my_Rcpp_fun = "NumericVector colMeans_c(NumericMatrix& mat) { int n_rows = mat.nrow(), n_cols= mat.ncol() ; NumericVector means(n_cols); for(int j = 0; j < n_cols; ++j) { double sum = 0; for(int i = 0; i < n_rows; ++i) sum += mat(i, j); means[j] = sum / n_rows; } return means; }" # compile function require(Rcpp) ; cppFunction(my_Rcpp_fun) # microbenchmark microbenchmark(vector_implementation = colMeans(my_data), rcpp_implementation = colMeans_c(my_data)) ``` ``` ## Unit: milliseconds ## expr min lq mean median uq max neval ## vector_implementation 7.57 8.34 8.76 8.70 9.14 11.2 100 ## rcpp_implementation 7.66 8.12 8.69 8.59 9.05 11.6 100 ``` ] --- # Alternative R implementations from <a href="http://adv-r.had.co.nz/Performance.html"> <high>Advanced R</high></a> | R implementation| Author | Description| |:------|:------|:------------| | [`pqR`](http://www.pqr-project.org/) |Radford Neal|This *p*retty *q*uick version of *R* builds on R 2.15.0; it fixes several performance issues, provides better memory management and some support for automatic multithreading.| | [`Renjin`](http://www.renjin.org/) |BeDataDriven| Renjin uses the Java virtual machine.| | [`FastR`](http://www.oracle.com/technetwork/java/jvmls2013vitek-2013524.pdf) |Purdue University| FastR is similar to Renjin. Optimisation is more ambitious, but at this point less mature.| | [`Riposte`](https://research.tableau.com/sites/default/files/pact2012-talbot-riposte.pdf) |Justin Talbot and Zachary DeVito| Riposte is experimental and ambitious. It's work in progress, but the existing implementations are extremely fast. Riposte is described in more detail in Riposte: A Trace-Driven Compiler and Parallel VM for Vector Code in R.| --- # Microsoft R Open .pull-left4[ Microsoft [R Open](https://mran.microsoft.com/) is the enhanced distribution of R from Microsoft Corporation. Open R interfaces with the high-performance, multi-threaded [<high>BLAS/LAPACK</high>](http://www.netlib.org/lapack/lug/node11.html) linear algebra libraries for superior performance. <high>Maximize reproducibility</high> by freezing the set of base packages with every version of Open R. Sort of R's Matlab. ] .pull-right45[ <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/openr.png" width="350"> ] --- # Efficient code is readable and maintainable Time-wise <high>one of the biggest cost factors is reading, understanding, debugging</high> your own code and that of others, and this time often far outweighs the benefits of making code faster. .pull-left45[ ```r library(readr ) library( magrittr) library( dplyr) syc = read_csv('https://tinyurl.com/y8gqcht3') syc = syc%>%mutate(fcm=father/2.54,mcm=mother/2.54) a = syc$father b = syc[['mother']] t.test(a,b) ``` ] .pull-right45[ ```r # load packages library(readr) library(magrittr) library(dplyr) # read galton data galton_data <- read_csv( 'https://tinyurl.com/y8gqcht3' ) # transform to metric system galton_data = galton_data %>% mutate(father_cm = father / 2.54, mother_cm = mother / 2.54) #conduct t-test father <- galton_data$father mother <- galton_data$mother t.test(father, mother) ``` ] --- # Practical <p><font size=6><b><a href="https://therbootcamp.github.io/BaselRBootcamp_2018July/_sessions/EfficientCode/EfficientCode_practical.html">Link to practical</a>