Chapter 16 Ensuring reproducibility

In the prior section we built a simulation driver. Because this function involves generating random numbers, re-running it with the exact same input parameters will still produce different results:

run_alpha_sim(iterations = 10, n = 50, p = 6, alpha = 0.73, df = 5)
##         criterion         est        MCSE
## 1      alpha bias -0.03322927 0.031758861
## 2      alpha RMSE  0.10090497 0.008743501
## 3 V relative bias  0.48077615 0.114109836
## 4        coverage  0.80000000 0.068920244
run_alpha_sim(iterations = 10, n = 50, p = 6, alpha = 0.73, df = 5)
##         criterion         est        MCSE
## 1      alpha bias 0.004056121 0.015203417
## 2      alpha RMSE 0.045790250 0.002703622
## 3 V relative bias 1.511919446 0.207758767
## 4        coverage 1.000000000 0.068920244

Of course, using a larger number of iterations will give us more precise estimates of the performance criteria. If we want to get the exact same results, however, we have to control the random process.

This is more possible than it sounds: Monte Carlo simulations are random, but computers are not. When we generate “random numbers” they actually come from a chain of mathematical equations that, given a number, will generate the next number in a deterministic sequence. Given that number, it will generate the next, and so on. The numbers we get back are a part of this chain of (very large) numbers that, ideally, cycles through an extremely long list of numbers in a haphazard and random looking fashion.

This is what the seed argument that we have glossed over before is all about. If we set the same seed, we get the same results:

run_alpha_sim(iterations = 10, n = 50, p = 6, alpha = 0.73, df = 5,
              seed = 6)
##         criterion         est       MCSE
## 1      alpha bias -0.02053560 0.02585963
## 2      alpha RMSE  0.08025083 0.01344827
## 3 V relative bias  0.64909209 1.43035647
## 4        coverage  0.90000000 0.06892024
run_alpha_sim(iterations = 10, n = 50, p = 6, alpha = 0.73, df = 5, 
              seed = 6)
##         criterion         est       MCSE
## 1      alpha bias -0.02053560 0.02585963
## 2      alpha RMSE  0.08025083 0.01344827
## 3 V relative bias  0.64909209 1.43035647
## 4        coverage  0.90000000 0.06892024

This is useful because it ensure the full reproducibility of the results. In practice, it is a good idea to always set seed values for your simulations, so that you (or someone else!) can exactly reproduce the results. Let’s look more at how this works.

16.1 Seeds and pseudo-random number generators

In R, we can start a sequence of deterministic but random-seeming numbers by setting a “seed”. This means all the random numbers that come after it will be the same.

Compare this:

rchisq(3, df = 5)
## [1] 1.193539 5.338936 6.294274
rchisq(3, df = 5)
## [1] 0.7189536 6.3697718 8.2913988

To this:

set.seed(20210527)
rchisq(3, df = 5)
## [1] 1.643698 6.229613 3.249164
set.seed(20210527)
rchisq(3, df = 5)
## [1] 1.643698 6.229613 3.249164

By setting the seed the second time we reset our sequence of random numbers. Similarly, to ensure reproducibility in our simulation, we will add an option to set the seed value of the random number generator.

This seed is low-level, meaning if we are generating numbers via rnorm or rexp or rchisq, it doesn’t matter. Each time we ask for a random number from the low-level pseudo-random number generator, it gives us the next number back. These other functions, like rnorm(), etc., all call this low-level generator and than transform the number to be of the correct distribution.

16.2 Including seed in our simulation driver

The easy way to ensure reproducability is to pass a seed as a parameter. If we leave it NULL, we ignore it and just continue generating random numbers from wherever we are in the system. If we specify a seed, however, we set it at the beginning of the scenario, and then all the numbers that follow will be in sequence. Our code will typically look like this:

run_alpha_sim <- function(iterations, n, p, alpha, df, coverage = 0.95, seed = NULL) {
  
  if (!is.null(seed)) set.seed(seed)

  results <- 
    replicate(n = iterations, {
      dat <- r_mvt_items(n = n, p = p, alpha = alpha, df = df)
      estimate_alpha(dat, coverage = coverage)
    }, simplify = FALSE) %>%
    bind_rows()
  
  alpha_performance(results, alpha = alpha, coverage = coverage)
}

Using our seed, we get identical results, as we saw in the intro, above.

16.3 Reasons for setting the seed

Reproducibility allows us to easily check if we are running the same code that generate the results in some report. It also helps with debugging. For example, say we had an error that showed up one in a thousand, causing our simulation to crash sometimes.

If we set a seed, and see that it crashes, we can then go try and catch the error and repair our code, and then rerun the simulation. If it runs clean, we know we got the error. If we had not set the seed, we would not know if we were just getting (un) lucky, and avoiding the error by chance.