Chapter 8 Running the Simulation Process

In the prior two chapters we saw how to write functions that generate data according to a particular model and functions that implement data-analysis procedures on simulated data. The next step in a simulation involves putting these two pieces together, running the DGP function and the data-analysis function repeatedly to obtain results (in the form of point estimates, standard errors, confidence intervals, p-values, or other quantities) from many replications of the whole process.

As with most things R-related, there are many different techniques that can be used to repeat a set of calculations over and over. In this chapter, we demonstrate several techniques for doing so. We then explain how to ensure reproducibility of simulation results by setting the seed used by R’s random number generator.

8.1 Repeating oneself

Suppose that we want to simulate Pearson’s correlation coefficient calculated based on a sample from the bivariate Poisson function. We saw a DGP function for the bivariate Poisson in Section 6.3, and an estimation function in Section 7.1. To produce a simulated correlation coefficient, we need to run these two functions in turn:

dat <- r_bivariate_Poisson( N = 30, rho = 0.4, mu1 = 8, mu2 = 14 )
r_and_z(dat)
## # A tibble: 1 × 4
##       r     z CI_lo CI_hi
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.451 0.485 0.108 0.698

To execute a simulation with these components, we need to repeat this set of calculations over and over. R has many different functions for doing exactly this. As one of many alternatives, the simhelpers package includes a function called repeat_and_stack(), which can be used to evaluate an arbitrary expression many times over. We can use it to generate five replications of our correlation coefficient:

library(simhelpers)
repeat_and_stack(
  n = 5, 
  {
    dat <- r_bivariate_Poisson( N = 30, rho = 0.4, mu1 = 8, mu2 = 14 )
    r_and_z(dat)
  }, 
  id = "rep", 
  stack = TRUE
)
##           r         z       CI_lo     CI_hi rep
## 1 0.2010788 0.2038566 -0.17162321 0.5234295   1
## 2 0.1744016 0.1762028 -0.19832890 0.5030627   2
## 3 0.3409210 0.3551343 -0.02205741 0.6244884   3
## 4 0.1529235 0.1541326 -0.21943517 0.4863955   4
## 5 0.1773770 0.1792732 -0.19537754 0.5053524   5

The first argument specifies the number of times to repeat the calculation. The second argument is an R expression that will be evaluated. The expression is wrapped in curly braces ({}) because it involves more than a single line of code. Including the option id = "rep" returns a dataset that includes a variable called rep to identify each replication of the process. Setting the option stack = TRUE will stack up the output of each expression into a single tibble, which will facilitate later calculations on the results. Setting this option is not necessary because it is TRUE by default; setting stack = FALSE will return the results in a list rather than a tibble (try this for yourself to see!).

There are many other functions that work very much like repeat_and_stack(), including the base-R function replicate() and the now-deprecated function rerun() from {purrr}. The functions in the map() family from {purrr} can also be used to do the same thing as repeat_and_stack(). See Appendix A.1 for more discussion of these alternatives.

8.2 One run at a time

A slightly different technique for running multiple replications of a process is to first write a function that executes a single run of the simulation, and then repeatedly evaluate that single function. For instance, here is a function that stitches together the two steps in the bivariate-Poisson correlation simulation:

one_bivariate_Poisson_r <- function(N, rho, mu1, mu2) {
  dat <- r_bivariate_Poisson( N = N, rho = rho, mu1 = mu1, mu2 = mu2 )
  res <- r_and_z(dat)
  return(res)
}

Calling the function produces a nicely formatted set of results:

one_bivariate_Poisson_r(N = 30, rho = 0.4, mu1 = 8, mu2 = 14)
## # A tibble: 1 × 4
##       r     z CI_lo CI_hi
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.479 0.521 0.143 0.716

We can then evaluate the function over and over using repeat_and_stack():

repeat_and_stack(
  n = 5, 
  one_bivariate_Poisson_r( N = 30, rho = 0.4, mu1 = 8, mu2 = 14 ), 
  id = "rep"
)
##           r         z        CI_lo     CI_hi rep
## 1 0.6389444 0.7563878  0.362006040 0.8122420   1
## 2 0.4365944 0.4680154  0.090571305 0.6885591   2
## 3 0.3637603 0.3812129  0.004017584 0.6401383   3
## 4 0.4106306 0.4363695  0.059105323 0.6715521   4
## 5 0.1547204 0.1559730 -0.217682692 0.4877992   5

This technique of wrapping the data-generating function and estimation function inside of another function might strike you as a bit cumbersome because the wrapper is only two lines of code and writing it requires repeating many of the function argument names when calling the data-generating function (N = N, rho = rho, etc.). However, the wrapper technique can be useful for more complicated simulations, such as those that involve comparison of multiple estimation methods.

Consider the cluster-randomized experiment case study presented in Section 6.6 and 7.2. In this simulation, we are interested in comparing the performance of three different estimation methods: a multi-level model, a linear regression with clustered standard errors, and a linear regression on the data aggregated to the school level. A single replication of the simulation entails generating a dataset and then applying three different estimation functions to it. Here is a function that takes our simulation parameters and runs a single trial of the full process:

one_run <- function( 
  n_bar = 30, J = 20, gamma_1 = 0.3, gamma_2 = 0.5,
  sigma2_u = 0.20, sigma2_e = 0.80, alpha = 0.75 
) {
  
  dat <- gen_cluster_RCT(
    n_bar = n_bar, J = J, gamma_1 = gamma_1, gamma_2 = gamma_2,
    sigma2_u = sigma2_u, sigma2_e = sigma2_e, alpha = alpha 
  )
  MLM <- analysis_MLM( dat )
  LR <- analysis_OLS( dat )
  Agg <- analysis_agg( dat )
  
  bind_rows( MLM = MLM, LR = LR, Agg = Agg, .id = "method" )
}

We have added a bunch of defaults to our function, so that we can run it without having to remember all the various input parameters. When we call the function, we get a nicely structured table of results:

one_run( n_bar = 30, J = 20, alpha=0.5 )
## # A tibble: 3 × 4
##   method ATE_hat SE_hat p_value
##   <chr>    <dbl>  <dbl>   <dbl>
## 1 MLM      0.265  0.210  0.224 
## 2 LR       0.369  0.210  0.0967
## 3 Agg      0.247  0.211  0.258

We organize the output in a tibble to make it easier to do subsequent data processing and analysis. The results for each method are organized in separate lines. For each method, we record the impact estimate, its estimated standard error, and a nominal \(p\)-value. Note how the bind_rows() method can take naming on the fly, and give us a column of method, which will be very useful for keeping track of which results come from which estimation.

Once we have a function to execute a single run, we can produce multiple results using repeat_and_stack():

R <- 1000
ATE <- 0.30
runs <- repeat_and_stack(R, 
                         one_run( n_bar = 30, J=20, gamma_1 = ATE ),
                         id = "runID") 
saveRDS( runs, file = "results/cluster_RCT_simulation.rds" )

Setting id = "runID" lets us keep track of which iteration number produced which result. Once our simulation is complete, we save our results to a file so that we can avoid having to re-run the full simulation if we want to explore the results in some future work session.

We now have results for each of our estimation methods applied to each of 1000 generated datasets. The next step is to evaluate how well the estimators did. For example, we will want to examine questions about bias, precision, and accuracy of the three point estimators. In Chapter 9, we will look systematically at ways to quantify the performance of estimation methods.

8.2.1 Reparameterizing

In Section 6.6.4, we discussed how to index the DGP of the cluster-randomized experiment using an intra-class correlation (ICC) instead of using two separate variance components. This type of re-parameterization can be handled as part of writing a wrapper function for executing the DGP and estimation procedures. Here is a revised version of one_run(), which also renames some of the more obscure model parameters using terms that are easier to interpret:

one_run <- function( 
  n_bar = 30, J = 20, ATE = 0.3, size_coef = 0.5,
  ICC = 0.4, alpha = 0.75 
) {
  stopifnot( ICC >= 0 && ICC < 1 )

  dat <- gen_cluster_RCT( 
    n_bar = n_bar, J=J, gamma_1 = ATE, gamma_2 = size_coef,
    sigma2_u = ICC, sigma2_e = 1-ICC, alpha = alpha 
  )
  
  MLM = analysis_MLM( dat )
  LR = analysis_OLS( dat )
  Agg = analysis_agg( dat )
  
  bind_rows( MLM = MLM, LR = LR, Agg = Agg, .id = "method" )
}

Note the stopifnot: it is wise to ensure our parameter transforms are all reasonable, so we do not get unexplained errors or strange results later on. It is best if your code fails as soon as possible! Otherwise debugging can be quite hard.

Controlling how we use the foundational elements such as our data generating code is a key technique for making the higher level simulations sensible and more easily interpretable. In the revised one_run() function, we transform the ICC input parameter into the parameters used by gen_cluster_RCT() so as to maintain the effect size interpretation of our simulation. We have not modified gen_cluster_RCT() at all: instead, we specify the parameters of the DGP function in terms of the parameters we want to directly control in the simulation. Here we have put our entire simulation into effect size units, and are now providing “knobs” to the simulation that are directly interpretable.

8.3 Bundling simulations with simhelpers

The techniques that we have demonstrated for repeating a set of calculations each involve a very specific coding pattern, which will often have the same structure even if the details of the data-generating model or the names of the input parameters are very different from the examples we have presented. The simhelpers package provides a function bundle_sim() that abstracts this common pattern and allows you to automatically stitch together (or “bundle”) a DGP function and an estimation function, so that they can be run once or multiple times. Thus, bundle_sim() provides a convenient alternative to writing your own one_run() function for each simulation, thereby saving a bit of typing (and avoiding an opportunity for bugs to creep into your code).

bundle_sim() takes a DGP function and an estimation function as inputs and gives us back a new function that will run a simulation using whatever parameters we give it. Here is a basic example, which creates a function for simulating Pearson correlation coefficients with a bivariate Poisson distribution:

sim_r_Poisson <- bundle_sim(f_generate = r_bivariate_Poisson, 
                            f_analyze = r_and_z, 
                            id = "rep")

If we specify the optional argument id = "rep", the function will include a variable called rep with a unique identifier for each replication of the simulation process. We can use the newly created function like so:

sim_r_Poisson( 4, N = 30, rho = 0.4, mu1 = 8, mu2 = 14)
##           r         z      CI_lo     CI_hi rep
## 1 0.5383113 0.6017748 0.22087860 0.7526197   1
## 2 0.3820938 0.4025091 0.02530846 0.6525370   2
## 3 0.5426774 0.6079428 0.22673760 0.7552815   3
## 4 0.3948291 0.4175082 0.04029110 0.6610653   4

To create this simulation function, bundle_sim() examined r_bivariate_Poisson(), figured out what its input arguments are, and made sure that the simulation function includes the same input arguments. You can see the full set of arguments for sim_r_Poisson() by evaluating it with args():

args(sim_r_Poisson)
## function (reps, N, rho, mu1, mu2, seed = NA_integer_) 
## NULL

In addition to the expected arguments from r_bivariate_Poisson(), the function has some additional inputs. Its first argument is reps, which controls the number of times that the simulation process will be evaluated. Its last argument is seed, which we will discuss in Section 8.4.

The bundle_sim() function requires specifying a DGP function and a single estimation function, with the data as the first argument. For our cluster-randomized experiment example, we would then need to use our estimate_Tx_Fx() function that organizes all of the estimators (see Chapter 7.2). We then use bundle_sim() to create a function for running an entire simulation:

sim_cluster_RCT <- bundle_sim( gen_cluster_RCT, estimate_Tx_Fx, id = "runID" )

We can call the newly created function like so:

sim_cluster_RCT( 2, 
                 n_bar = 30, J = 20, gamma_1 = ATE, 
                 sigma2_u = 0.3, sigma2_e = 0.7 )
##   estimator      ATE_hat    SE_hat    p_value runID
## 1       MLM  0.599364108 0.2242121 0.01550328     1
## 2       OLS  0.657654196 0.2558709 0.02643194     1
## 3       agg  0.594736991 0.2121123 0.01173875     1
## 4       MLM  0.042192269 0.3760692 0.91194164     2
## 5       OLS -0.047456707 0.3212931 0.88537851     2
## 6       agg  0.009295049 0.3720199 0.98034156     2

Again, bundle_sim() produces a function with input names that exactly match the inputs of the DGP function that we give it. It is not possible to re-parameterize or change argument names, as we did with one_run() in Section 8.2.1. See Exercise 8.5.3 for further discussion of this limitation.

To use the simulation function in practice, we call it by specifying the number of replications desired (which we have stored in R) and any relevant input parameters.

runs <- sim_cluster_RCT( R, 
                         n_bar = 30, J = 20, gamma_1 = ATE, 
                         sigma2_u = 0.3, sigma2_e = 0.7 )
saveRDS( runs, file = "results/cluster_RCT_simulation.rds" )

The bundle_sim() function is just a convenient way to create a function that pieces together the steps in the simulation process, which is especially useful when the component functions include many input parameters. The function has several further features, which we will demonstrate in subsequent chapters.

8.4 Seeds and pseudo-random number generators

In prior chapters, we have used built-in functions to generate random numbers and also written our own data-generating functions that produce artificial data following a specific random process. With either type of function, re-running it with the exact same input parameters will produce different results. For instance, running the rchisq function with the same set of inputs will produce two different sequences of \(\chi^2\) random variables:

c1 <- rchisq(4, df = 3)
c2 <- rchisq(4, df = 3)
rbind(c1, c2)
##         [,1]      [,2]     [,3]      [,4]
## c1 0.1346756 0.5870003 1.555573 0.6332007
## c2 2.1417228 0.5042039 1.239760 4.2364007

If you run the same code as above, you will get different results from these. Likewise, running the bivariate Poisson function from Section 6.3 or the sim_r_Poisson() function from Section 8.3 multiple times will produce different datasets:

dat_A <- r_bivariate_Poisson(20, rho = 0.5, mu1 = 4, mu2 = 7)
dat_B <- r_bivariate_Poisson(20, rho = 0.5, mu1 = 4, mu2 = 7)
identical(dat_A, dat_B)
## [1] FALSE
sim_A <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14)
sim_B <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14)
identical(sim_A, sim_B)
## [1] FALSE

Of course, this is the intended behavior of these functions, but it has an important consequence that needs some care and attention. Using functions like rchisq(), r_bivariate_Poisson(), or r_bivariate_Poisson() in a simulation study means that the results will not be fully reproducible.

When using DGP functions for simulations, it is useful to be able to exactly control the process of generating random numbers. This is much more feasible than it sounds: Monte Carlo simulations are random, at least in theory, but computers are deterministic. When we use R to generate what we have been referring to as “random numbers,” the functions produce what are actually pseudo-random numbers. Pseudo-random numbers are generated from chains of mathematical equations designed to produce sequences of numbers that appear random, but actually follow a deterministic sequence. Each subsequent random number is a calculated by starting from the previously generated value (i.e., the current state of the random number generator), applying a complicated function, and storing the result (i.e., updating the state). The numbers returned by the generator form a chain that, ideally, cycles through an extremely long list of values in a way that looks stochastic and unpredictable.

The state of the pseudo-random number generator is shared across different functions that produce pseudo-random numbers, so it does not matter if we are generating numbers with rnorm() or rchisq() or r_bivariate_Poisson(). Each time we ask for a random number from the generator, its state is updated. Functions like rnorm() and rchisq() all call the low-level generator and then transform the result to be of the correct distribution.

Because the generator is actually deterministic, we can control its output by specify a starting value or initial state, In R, the state of the random number generator can be controlled by setting what its known as the seed. The set.seed() function allows us to specify a seed value, so that we can exactly reproduce a calculation. For example,

set.seed(6)
c1 <- rchisq(4, df = 3)
set.seed(6)
c2 <- rchisq(4, df = 3)
rbind(c1, c2)
##        [,1]    [,2]     [,3]     [,4]
## c1 2.575556 0.93847 8.062264 3.685932
## c2 2.575556 0.93847 8.062264 3.685932

Similarly, we can set the seed and run a series of calculations involving multiple functions that make use of the random number generator:

# First time
set.seed(6)
c1 <- rchisq(4, df = 3)
dat_A <- r_bivariate_Poisson(20, rho = 0.5, mu1 = 4, mu2 = 7)

# Exactly reproduce the calculations
set.seed(6)
c2 <- rchisq(4, df = 3)
dat_B <- r_bivariate_Poisson(20, rho = 0.5, mu1 = 4, mu2 = 7)

bind_rows(A = r_and_z(dat_A), B = r_and_z(dat_B), .id = "Rep")
## # A tibble: 2 × 5
##   Rep       r     z  CI_lo CI_hi
##   <chr> <dbl> <dbl>  <dbl> <dbl>
## 1 A     0.299 0.308 -0.165 0.655
## 2 B     0.299 0.308 -0.165 0.655

The bundle_sim() function demonstrated in Section 8.3 creates a function for repeating the process of generating and analyzing data. By default, the function it produces includes an argument seed, which allows the user to set a seed value before repeatedly evaluating the DGP function and estimation function. By default, the seed argument is NULL and so the current seed is not modified. Specifying an integer-valued seed will make the results exactly reproducible:

sim_A <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14,
                        seed = 25)
sim_B <- sim_r_Poisson( 10, N = 30, rho = 0.4, mu1 = 8, mu2 = 14,
                        seed = 25)
identical(sim_A, sim_B)
## [1] TRUE

In practice, it is a good idea to always set seed values for your simulations, so that you (or someone else) can exactly reproduce the results. Attending to reproducibility allows us to easily check if we are running the same code that generated a set of results. For instance, try running the previous blocks of code on your machine; if you set the seed to the same value as we did, you should get identical output.

Setting seeds is also very helpful for debugging. Suppose we had an error that showed up in one of a thousand replications, causing the simulation to crash sometimes. If we set a seed and find that the code crashes, we can debug and then rerun the simulation. If it now runs without error, we know we fixed the problem. If we had not set the seed, we would not know if we were just getting (un)lucky, and avoiding the error by chance.

8.5 Exercises

8.5.1 Welch simulations

In the prior chapter’s exercises, you made a new BF_F function for the Welch simulation. Now incorporate the BF_F function into the one_run() function, and use your revised function to generate simulation results for this additional estimator.

8.5.2 Compare sampling distributions of Pearson’s correlation coefficients

  1. Use sim_r_Poisson() to generate 5000 replications of the Fisher-z-transformed correlation coefficient under a bivariate Poisson distribution with \(\rho = 0.7\) for a sample of \(N = 40\) observations. You pick the remaining parameters.

  2. Create a bundled simulation function by combining the data-generating function from Exercise 6.9.5 or 6.9.6 with the r_and_z() estimation function. Run the function to generate 5000 replications of the Fisher-z-transformed correlation coefficient under a bivariate negative binomial distribution, with the same parameter values as above.

  3. Create a plot that shows both sampling distributions, making it easy to compare the distributions.

8.5.3 Reparameterization, redux

In Section 8.2.1, we illustrated how the one_run() simulation wrapper function could be tweaked in order to reparameterize the model in terms of a single intra-class correlation rather than two variance parameters. But what if you want to avoid having to write your own simulation wrapper and instead use bundle_sim()? Revise gen_cluster_RCT() to accomplish the reparameterization, then bundle it with analyze_data().

8.5.4 Fancy clustered RCT simulations

In Exercise 6.9.10, your task was to write a data-generating function for a cluster-randomized trial that includes a baseline covariate. Then in Exercise 7.5.4, your task was to create an estimation function that could be applied to data from such a trial.

  1. Now, using your work from these exercises, create a bundled simulation function that combines the data-generating function and the estimation function. Ensure that the estimation function returns average treatment effect estimates both with and without controlling for the baseline covariate.

  2. Use the resulting function to simulate results from 1000 replications of a cluster-randomized trial, each analyzed two ways (with and without covariate adjustment).

  3. Create a plot that shows the sampling distribution of each average treatment effect estimator in a way that makes it easy to compare the distributions.