Chapter 17 Optimizing code (and why you often shouldn’t)

Optimizing code is when you spend a bit more human effort to write code that will run faster on your computer. In some cases, this can be a critical boost to running a simulation, where you inherently will be doing things a lot of times. Cutting runtime down will always be tempting, as it allows you to run more replicates and get more precisely estimated performance measures for your simulation.

That being said, beyond a few obvious coding tricks we will discuss, one should optimize code only after you discover you need to. Optimizing as you go usually means you will spend a lot of time wrestling with code far more complicated than it needs to be. For example, often it is the estimation method that will take a lot of computational time, so having very fast data generation code won’t help overall simulation runtimes much, as you are tweaking something that is only a small part of the overall pie, in terms of time. Keep things simple; in general your time is more important than the computer’s time.

In the next sections we will look at a few optimization efforts applied to the ANOVA example in the prior chapters.

17.1 Hand-building functions

In the Welch example above, we used the system-implemented ANOVA. An alternative approach would be to “hand roll” the ANOVA F statistic and test directly. Doing so by hand can set you up to implement modified versions of these tests later on. Also, although hand-building a method does take more work to program, it can result in a faster piece of code (this actually is the case here) which in turn can make the overall simulation faster.

Following the formulas on p. 129 of Brown and Forsythe (1974) we have (using data as generated in Chapter 6:

ANOVA_F <- function(sim_data) {

  x_bar <- with(sim_data, tapply(x, group, mean))
  s_sq <- with(sim_data, tapply(x, group, var))
  n <- table(sim_data$group)
  g <- length(x_bar)

  df1 <- g - 1
  df2 <- sum(n) - g

  msbtw <- sum(n * (x_bar - mean(sim_data$x))^2) / df1
  mswn <- sum((n - 1) * s_sq) / df2
  fstat <- msbtw / mswn
  pval <- pf(fstat, df1, df2, lower.tail = FALSE)

  return(pval)
}

ANOVA_F(sim_data)
## [1] 0.009208908

To see the difference between our version and R’s version, we can use an R package called microbenchmark to test how long the computations take for each version of the function. The microbenchmark function runs each expression 100 times (by default) and tracks how long the computations take. It then summarizes the distribution of timings:

library(microbenchmark)
timings <- microbenchmark(Rfunction = ANOVA_F_aov(sim_data),
                          direct    = ANOVA_F(sim_data))
timings
## Unit: microseconds
##       expr   min    lq    mean median     uq
##  Rfunction 380.9 414.6 507.974  434.7 468.45
##     direct 148.8 167.5 198.507  185.3 202.65
##     max neval
##  3151.9   100
##   493.5   100

The direct function is 2.6 times faster than the built-in R function.

This result is not unusual. Built-in R functions usually include lots of checks and error-handling, which take time to compute. These checks are crucial for messy, real-world data analysis but unnecessary with our pristine, simulated data. Here we can skip them by doing the calculations directly. In general, however, this is a trade-off: writing something yourself gives you a lot of chance to do something wrong, throwing off all your simulations. It might be faster, but you may pay dearly for it in terms of extra hours coding and debugging. Optimize only if you need to!

17.2 Computational efficiency versus simplicity

An alternative approach to having a function that, for each call, generates a single set of data, would be to write a function that generates multiple sets of simulated data all at once.

For example, for our ANOVA example we could specify that we want R replications of the study and have the function spit out a matrix with R columns, one for each simulated dataset:

generate_data_matrix <- function(mu, sigma_sq, sample_size, R) {

  N <- sum(sample_size) 
  g <- length(sample_size) 
  
  group <- rep(1:g, times = sample_size) 
  mu_long <- rep(mu, times = sample_size)
  sigma_long <- rep(sqrt(sigma_sq), times = sample_size) 

  x_mat <- matrix(rnorm(N * R, mean = mu_long, sd = sigma_long),
                  nrow = N, ncol = R)
  sim_data <- list(group = group, x_mat = x_mat)
    
  return(sim_data)
}

generate_data_matrix(mu = mu, sigma_sq = sigma_sq,
                     sample_size = sample_size, R = 4)
## $group
##  [1] 1 1 1 2 2 2 2 2 2 3 3 4 4 4 4
## 
## $x_mat
##             [,1]      [,2]       [,3]       [,4]
##  [1,]  2.1639323  1.646791 -0.9445957  1.2050393
##  [2,] -2.8225439  3.289125  0.8863700  3.7816431
##  [3,]  1.3419979 -3.020363  1.6196826  0.6425271
##  [4,] -0.5939495  1.184961  1.0200117  2.8301117
##  [5,]  0.1054279  3.499446  1.0267897  2.3166570
##  [6,]  4.0618127  4.340879  4.1291935 -1.0532910
##  [7,]  2.0611736  2.544518  1.3772973 -0.3861088
##  [8,]  2.2921786  2.299136  0.1706634  2.0329848
##  [9,]  2.5000308  3.686171  1.8712605  2.8557713
## [10,]  4.0966732  5.824634  6.6448117  2.6354699
## [11,]  5.9111997  3.480771  2.9453236  4.1680932
## [12,]  6.2201594  6.116809  5.8446326  6.2140816
## [13,]  5.9843999  6.488239  7.1954303  6.6280499
## [14,]  6.3928820  4.699057  5.9033074  8.8796297
## [15,]  7.9611464  7.282270  5.9095360  5.9131754

This approach is a bit more computationally efficient because the setup calculations (getting N, g, group, mu_full, and sigma_full) only have to be done once instead of once per replication. It also makes clever use of vector recycling in the call to rnorm(). However, the structure of the resulting data is more complicated, which will make it more difficult to do the later estimation steps. Furthermore, if the number of replicates R is large and each replication produces a large dataset, this “all-at-once” approach will entail generating and holding very large amounts of data in memory, which can create other performance issues. On balance, we recommend the simpler approach of writing a function that generates a single simulated dataset per call (unless and until you have a principled reason to do otherwise).

17.3 Reusing code to speed up computation

Computational and programming efficiency should usually be a secondary consideration when you are starting to design a simulation study. It is better to produce accurate code, even if it is a bit slow, than to write code that is speedy but hard to follow (or even worse, that produces incorrect results). All that said, there is some glaring redundancy in the two functions used for the ANOVA simulation. Both ANOVA_F and Welch_F start by taking the simulated data and calculating summary statistics for each group, using the following code:

x_bar <- with(sim_data, tapply(x, group, mean))
s_sq <- with(sim_data, tapply(x, group, var))
n <- table(sim_data$group)
g <- length(x_bar)

In the interest of not repeating ourselves, it would better to pull this code out as a separate function and then re-write the ANOVA_F and Welch_F functions to take the summary statistics as input. Here is a function that takes simulated data and returns a list of summary statistics:

summarize_data <- function(sim_data) {
  
  res <- sim_data %>% 
    group_by( group ) %>%
    summarise( x_bar = mean( x ),
               s_sq = var( x ),
               n = n() )
  res
}

We just packaged the code from above, and puts our results in a nice table (and thus pivoted to using tidyverse to calculate these things):

sim_data = generate_data(mu=mu, sigma_sq=sigma_sq, sample_size=sample_size)
summarize_data(sim_data)
## # A tibble: 4 × 4
##   group x_bar  s_sq     n
##   <int> <dbl> <dbl> <int>
## 1     1  1.90 1.42      3
## 2     2  1.82 2.10      6
## 3     3  4.40 0.626     2
## 4     4  6.58 0.485     4

Now we can re-write both \(F\)-test functions to use the output of this function:

ANOVA_F_agg <- function(x_bar, s_sq, n) {
  g = length(x_bar)
  df1 <- g - 1
  df2 <- sum(n) - g
  
  msbtw <- sum(n * (x_bar - weighted.mean(x_bar, w = n))^2) / df1
  mswn <- sum((n - 1) * s_sq) / df2
  fstat <- msbtw / mswn
  pval <- pf(fstat, df1, df2, lower.tail = FALSE)
 
  return(pval)
}

summary_stats <- summarize_data(sim_data)
with(summary_stats, ANOVA_F_agg(x_bar = x_bar, s_sq = s_sq, n = n))
## [1] 0.0003129849
Welch_F_agg <- function(x_bar, s_sq, n) {
  g = length(x_bar)
  w <- n / s_sq
  u <- sum(w)
  x_tilde <- sum(w * x_bar) / u
  msbtw <- sum(w * (x_bar - x_tilde)^2) / (g - 1)

  G <- sum((1 - w / u)^2 / (n - 1))
  denom <- 1 +  G * 2 * (g - 2) / (g^2 - 1)
  W <- msbtw / denom
  f <- (g^2 - 1) / (3 * G)

  pval <- pf(W, df1 = g - 1, df2 = f, lower.tail = FALSE)

  return(pval)
}

with(summary_stats, ANOVA_F_agg(x_bar = x_bar, s_sq = s_sq, n = n))
## [1] 0.0003129849

The results are the same as before.

We should always test any optimized code against something we know is stable, since optimization is an easy way to get bad bugs. Here we check against R’s implementation:

summary_stats <- summarize_data(sim_data)
F_results <- with(summary_stats,
                  ANOVA_F_agg(x_bar = x_bar, s_sq = s_sq, n = n))
aov_results <- oneway.test(x ~ factor(group), data = sim_data, 
                           var.equal = TRUE)
all.equal(aov_results$p.value, F_results)
## [1] TRUE
W_results <- with(summary_stats,
                  Welch_F_agg( x_bar = x_bar,
                               s_sq = s_sq, n = n))
aov_results <- oneway.test(x ~ factor(group),
                           data = sim_data, 
                           var.equal = FALSE)
all.equal(aov_results$p.value, W_results)
## [1] TRUE

Here we are able to check against a known baseline. Checking estimation functions can be a bit more difficult for procedures that are not already implemented in R. For example, the two other procedures examined by Brown and Forsythe, the James’ test and Brown and Forsythe’s \(F*\) test, are not available in base R. They are, however, available in the user-contributed package onewaytests, found by searching for “Brown-Forsythe” at http://rseek.org/. We could benchmark our calculations against this package, but of course there is some risk that the package might not be correct. Another route is to verify your results on numerical examples reported in authoritative papers, on the assumption that there’s less risk of an error there. In the original paper that proposed the test, Welch (1951) provides a worked numerical example of the procedure. He reports the following summary statistics:

g <- 3
x_bar <- c(27.8, 24.1, 22.2)
s_sq <- c(60.1, 6.3, 15.4)
n <- c(20, 20, 10)

He also reports \(W = 3.35\) and \(f = 22.6\). Replicating the calculations with our Welch_F_agg function:

Welch_F_agg(x_bar = x_bar, s_sq = s_sq, n = n)
## [1] 0.05479049

We get slightly different results! But we know that our function is correct—or at least consistent with oneway.test—so what’s going on? It turns out that there was an error in some of Welch’s intermediate calculations, which can only be spotted because he reported all of his work in the paper.

We then put all these pieces in our revised one_run() method as so:

one_run_fast <- function( mu, sigma_sq, sample_size ) {
  sim_data <- generate_data(mu = mu, sigma_sq = sigma_sq,
                            sample_size = sample_size)
  summary_stats <- summarize_data(sim_data)
  anova_p <- with(summary_stats, 
                  ANOVA_F_agg(x_bar = x_bar,s_sq = s_sq, n = n))
  Welch_p <- with(summary_stats, 
                  Welch_F_agg(x_bar = x_bar, s_sq = s_sq, n = n))
  tibble(ANOVA = anova_p, Welch = Welch_p)
}

one_run_fast( mu = mu, sigma_sq = sigma_sq,
              sample_size = sample_size )
## # A tibble: 1 × 2
##      ANOVA  Welch
##      <dbl>  <dbl>
## 1 0.000982 0.0209

The reason this is important is we are now doing our group aggregation only once, rather than once per method. We can use our microbenchmark to see our speedup:

library(microbenchmark)
timings <- microbenchmark(noagg = one_run(mu = mu, sigma_sq = sigma_sq, 
                                          sample_size = sample_size),
                          agg = one_run_fast(mu = mu, sigma_sq = sigma_sq, 
                                             sample_size = sample_size) )
timings
## Unit: milliseconds
##   expr    min      lq     mean  median      uq
##  noagg 1.1381 1.19580 1.508178 1.23110 1.27585
##    agg 2.6466 2.70205 3.001667 2.81065 3.14450
##      max neval
##  18.4807   100
##   7.1078   100

And our relative speedup is:

with(summary(timings), round(mean[1] / mean[2], 1))
## [1] 0.5

To recap, there are two advantages of this kind of coding:

  1. Code reuse is generally good because when you have the same code in multiple places it can make it harder to read and understand your code. If you see two blocks of code you might worry they are only mostly similar, not exactly similar, and waste time trying to differentiate. If you have a single, well-named function, you immediately know what a block of code is doing.

  2. Saving the results of calculations can speed up your computation since you are saving your partial work. This can be useful to reduce calculations that are particularly time intensive.