Chapter 10 Designing and executing multifactor simulations

Thus far, we have created code that will run a simulation for a single combination of parameter values. In practice, simulation studies typically examine a range of different values, including varying the levels of the focal parameter values, auxiliary parameters, sample size, and possibly other design parameters, to explore a range of different scenarios. We either want reassurance that our findings are general, or we want to understand what aspects of the context affect the performance of the estimator or estimators we are studying. A single simulation gives us no hint as to either of these questions. It is only by looking across a range of settings that we can hope to understand trade-offs, general rules, and limits. Let’s now look at the remaining piece of the simulation puzzle: the study’s experimental design.

Simulation studies often take the form of full factorial designed experiments. In full factorials, each factor (a particular knob a researcher might turn to change the simulation conditions) is varied across multiple levels, and the design includes every possible combination of the levels of every factor. One way to represent such a design is as a list of the factors and levels to be explored.

For example, consider a simulation study examining the performance of confidence intervals for Pearson’s correlation coefficient under a bivariate Poisson distribution. We examined this data-generating model in Section 6.1.2, implementing it in the function r_bivariate_Poisson(). The model has three parameters (the means of each variate, \(\mu_1, \mu_2\) and the correlation \(\rho\)) and there is one design parameter (sample size, \(N\)). Thus, we could in principle examine up to four factors.

Using these parameters directly as factors in the simulation design will lead to considerable redundancy because of the symmetry of the model: generating data with \(\mu_1 = 10\) and \(\mu_2 = 5\) would lead to identical correlations as using \(\mu_1 = 5\) and \(\mu_2 = 10\). It is useful to re-parameterize to reduce redundancy and simply things. We will therefore define the simulation conditions by always treating \(\mu_1\) as the larger variate and by specifying the ratio of the smaller to the larger mean as \(\lambda = \mu_2 / \mu_1\). We might then examine the following factors:

the sample size, with values of \(N = 10, 20\), or \(30\)
the mean of the larger variate, with values of \(\mu_1 = 4, 8\), or \(12\)
the ratio of means, with values of \(\lambda = 0.5\) or \(1.0\).
the true correlation, with values ranging from \(\rho = 0.0\) to \(0.7\) in steps of \(0.1\)

The above parameters describe a \(3 \times 3 \times 2 \times 8\) factorial design, where each element is the number of levels for that factor. This is a four-factor experiment, because we have four different things we are varying.

To implement this design in code, we first save the simulation parameters as a list with one entry per factor, where each entry consists of the levels that we would like to explore. We will run a simulation for every possible combination of these values. Here is code that generates all of the scenarios given the above design, storing these combinations in a data frame, params, that represents the full experimental design:

design_factors <- list(
  N = c(10, 20, 30),
  mu1 = c(4, 8, 12),
  lambda = c(0.5, 1.0),
  rho = seq(0.0, 0.7, 0.1)
)

lengths(design_factors)

##      N    mu1 lambda    rho 
##      3      3      2      8

params <- expand_grid( !!!design_factors )
params

## # A tibble: 144 × 4
##        N   mu1 lambda   rho
##    <dbl> <dbl>  <dbl> <dbl>
##  1    10     4    0.5   0  
##  2    10     4    0.5   0.1
##  3    10     4    0.5   0.2
##  4    10     4    0.5   0.3
##  5    10     4    0.5   0.4
##  6    10     4    0.5   0.5
##  7    10     4    0.5   0.6
##  8    10     4    0.5   0.7
##  9    10     4    1     0  
## 10    10     4    1     0.1
## # ℹ 134 more rows

We use expand_grid() from the tidyr package to create all possible combinations of the four factors.¹⁸ We have a total of \(3 \times 3 \times 2 \times 8 = 144\) rows, each row corresponding to a simulation scenario to explore. With multifactor experiments, it is easy to end up running a lot of experiments!

The multi-factor aspect of a simulation is incredible important. It can take us from an overly narrow exploration to one that has broader significance. As Little (2013) puts it:

Good simulation studies are not given the respect they deserve. Often the design is perfunctory and simplistic, neglecting to attempt a factorial experimental design to cover the relevant sample space, and results are over-generalized. Well designed simulation studies with realistic sample sizes are an antidote to a ﬁxation on asymptotics and a useful tool for assessing calibration.

10.1 Choosing parameter combinations

How do we go about choosing parameter values to examine? Choosing which parameters to use is a central part of good simulation design because the primary limitation of simulation studies is always their generalizability. On the one hand, it is difficult to extrapolate findings from a simulation study beyond the set of simulation conditions that were examined. On the other hand, it is often difficult or impossible to examine the full space of all possible parameter values, except for very simple problems. Even in the relatively straightforward Pearson correlation simulation, we have four factors and the last three could each take on an infinite number of possible levels. How can we come up with a defensible set of levels to examine?

Broadly, you generally want to vary parameters that you believe matter, or that you think other people will believe matter. The first is so you can learn. The second is to build your case. The choice of simulation conditions needs to be made in the context of the problem or model that you are studying, so it is difficult to identify valid but acontextual principles. We nonetheless offer a few points of advice, informed by our experience conducting and reading simulation studies:

For research simulations, it often is important to be able to relate your findings to previous research. This suggests that you should select parameter levels to make this possible, such as by looking at sample sizes similar to those examined in previous studies. That said, previous simulation studies are not always perfect (actually, there are a lot of really crummy ones out there!), and so prior work should not generally be your sole guide or justification.
Generally, it is better to err on the side of being more comprehensive. You learn more by looking at a broader range of conditions, and you can always boil down your results to a more limited set of conditions for purposes of presentation.
It is also important to explore breakdown points (e.g., what sample size is too small for a method to work?) rather than focusing only on conditions where a method might be expected to work well. Pushing the boundaries and identifying conditions where estimation methods break will help you to provide better guidance for how the methods should be used in practice.

On point (2), comprehensiveness will generally increase the amount of computing required for a simulation. However, this can be tempered by reducing the number of replications per scenario. For example, say you were planning on doing 1000 simulations per scenario, but then you realize there is some other factor that you do not think matters, but that you believe other researchers will worry about. You could add in that factor, say with four levels, and then do 250 simulations per scenario. The total work remains the same. When analyzing the final simulation you would first verify you do not see trends along this new factor, and then marginalize out that factor in your summaries of results. Marginalizing out a factor (i.e., averaging your performance metrics across the additional factor) is a powerful technique of making a claim about how your methods work on average across a range of scenarios, rather than for a specific scenario. We will discuss it further in Chapter 11.

Once you have identified your parameters, you then have to decide on the levels of the parameter you will include in the simulation. There are three strategies you might take:

Vary a parameter over as much of its range as you can.
Choose parameter levels to represent a realistic practical range. These ranges would ideally be empirically justified based on systematic reviews of prior applications. Lacking such empirical evidence, you may need to rely on informal impressions of what is realistic in practice.
Choose parameters to emulate an important known application or context.

Of these choices, option (1) is the most general but also the most computationally intensive. In the ideal, option (2) focuses attention on what is of practical relevance to a practitioner. Option (3) is usually coupled with a subsequent applied data analysis, and in this case the simulation is often used to enrich that analysis. In particular, if the simulation shows the methods work for data with the given form of the target application, people may be more willing to believe the application’s findings.

Regardless of how you select your primary parameters, you should also vary nuisance parameters (at least a little) to test the sensitivity of your results to these other aspects. While simulations will (generally) never be fully generalizable, you can certainly make them so they avoid the obvious things a critic might identify as a basis for dismissing your findings.

To recap, as you think about your parameter selection, always keep the following design principles and acknowledgements:

The primary limitation of simulation studies is generalizability.
Choose conditions that allow you to relate your findings to previous work.
Err towards being comprehensive. Your goal should be to build an understanding of the major moving parts, and you can always tailor your final presentation of results to give the simplified story, once you find it.
Explore breakdown points and boundary conditions (e.g., what sample size is too small for applying a given method?).

Finally, you should fully expect to add and subtract from your set of simulation factors as you get your initial simulation results. Rarely does anyone nail the choice of parameters on the first pass.

10.2 Using pmap to run multifactor simulations

Once we have selected factors and levels for simulation, we now need to run the simulation code across all of our factor combinations. Conceptually, each row of our params dataset represents a single simulation scenario, and we want to run our simulation code for each of these scenarios. We would thus call our simulation function, using all the values in that row as parameters to pass to the function.

One way to call a function on each row of a dataset in this manner is by using pmap() from the purrr package. pmap() marches down a set of lists, running a function on each \(p\)-tuple of elements, taking the \(i^{th}\) element from each list for iteration \(i\), and passing them as parameters to the specified function. pmap() then returns the results of this sequence of function calls as a list of results.¹⁹ Because R’s data.frame objects are also sets of lists (where each variable is a vector, which is a simple form of list), pmap() also works seemlessly on data.frame or tibble objects.

Here is a small illustration of pmap() in action:

some_function <- function( a, b, theta, scale ) {
    scale * (a + theta*(b-a))
}

args_data <- tibble( a = 1:3, b = 5:7, theta = c(0.2, 0.3, 0.7) )
purrr::pmap( args_data, .f = some_function, scale = 10 )

## [[1]]
## [1] 18
## 
## [[2]]
## [1] 32
## 
## [[3]]
## [1] 58

One important constraint of pmap() is that the variable names over which to iterate over must correspond exactly to arguments of the function to be evaluated. In the above example, args_data must have column names that correspond to the arguments of some_function. For functions with additional arguments that are not manipulated, extra parameters can be passed after the function name (as in the scale argument in this example). These will also be passed to each function call, but will be the same for all calls.

Let’s now implement this technique for our simulation of confidence intervals for Pearson’s correlation coefficient. In Section 7.1, we developed a function called r_and_z() for computing confidence intervals for Pearson’s correlation using Fisher’s \(z\) transformation; then in Section 12.7, we wrote a function called evaluate_CIs() for evaluating confidence interval coverage and average width. We can bundle r_bivariate_Poisson(), r_and_z(), and evaluate_CIs() into a simulation driver function by taking

library(simhelpers)

Pearson_sim <- bundle_sim(
  f_generate = r_bivariate_Poisson, f_analyze = r_and_z, f_summarize = evaluate_CIs
)
args(Pearson_sim)

## function (reps, N, mu1, mu2, rho = 0, seed = NA_integer_, summarize = TRUE) 
## NULL

This function will run a simulation for a given scenario:

Pearson_sim(1000, N = 10, mu1 = 5, mu2 = 5, rho = 0.3)

## # A tibble: 1 × 5
##   K_coverage coverage coverage_mcse width
##        <int>    <dbl>         <dbl> <dbl>
## 1       1000    0.952       0.00676  1.10
## # ℹ 1 more variable: width_mcse <dbl>

In order to call Pearson_sim(), we will need to ensure that the columns of the params dataset correspond to the arguments of the function. Because we re-parameterized the model in terms of \(\lambda\), we will first need to compute the parameter value for \(\mu_2\) and remove the lambda variable because it is not an argument of Pearson_sim():

params_mod <- 
  params %>%
  mutate(mu2 = mu1 * lambda) %>%
  dplyr::select(-lambda)

Now we can use pmap() to run the simulation for all 144 parameter settings:

sim_results <- params
sim_results$res <- pmap(params_mod, Pearson_sim, reps = 1000 )

The above code calls our run_alpha_sim() method for each row in the list of scenarios we want to explore. Conveniently, we can store the results as a new variable in the same dataset.

sim_results

## # A tibble: 144 × 5
##        N   mu1   rho   mu2 res             
##    <dbl> <dbl> <dbl> <dbl> <list>          
##  1    10     4   0       2 <tibble [1 × 5]>
##  2    10     4   0.1     2 <tibble [1 × 5]>
##  3    10     4   0.2     2 <tibble [1 × 5]>
##  4    10     4   0.3     2 <tibble [1 × 5]>
##  5    10     4   0.4     2 <tibble [1 × 5]>
##  6    10     4   0.5     2 <tibble [1 × 5]>
##  7    10     4   0.6     2 <tibble [1 × 5]>
##  8    10     4   0.7     2 <tibble [1 × 5]>
##  9    10     4   0       4 <tibble [1 × 5]>
## 10    10     4   0.1     4 <tibble [1 × 5]>
## # ℹ 134 more rows

The above code may look a bit peculiar: we are storing a set of dataframes (our result) in our original dataframe. This is actually ok in R: our results will be in what is called a list-column, where each element in our list column is the little summary of our simulation results for that scenario. Here is the third scenario, for example:

sim_results$res[[3]]

## # A tibble: 1 × 5
##   K_coverage coverage coverage_mcse width
##        <int>    <dbl>         <dbl> <dbl>
## 1       1000    0.947       0.00708  1.13
## # ℹ 1 more variable: width_mcse <dbl>

List columns are neat, but hard to work with. To turn the list-column into normal data, we can use unnest() to expand the res variable, replicating the values of the main variables once for each row in the nested dataset:

sim_results <- unnest(sim_results, cols = res)
sim_results

## # A tibble: 144 × 9
##        N   mu1   rho   mu2 K_coverage coverage
##    <dbl> <dbl> <dbl> <dbl>      <int>    <dbl>
##  1    10     4   0       2       1000    0.949
##  2    10     4   0.1     2       1000    0.954
##  3    10     4   0.2     2       1000    0.947
##  4    10     4   0.3     2       1000    0.94 
##  5    10     4   0.4     2       1000    0.953
##  6    10     4   0.5     2       1000    0.953
##  7    10     4   0.6     2       1000    0.938
##  8    10     4   0.7     2       1000    0.945
##  9    10     4   0       4       1000    0.956
## 10    10     4   0.1     4       1000    0.94 
## # ℹ 134 more rows
## # ℹ 3 more variables: coverage_mcse <dbl>,
## #   width <dbl>, width_mcse <dbl>

Putting all of this together into a tidy workflow leads to the following:

sim_results <- 
  params %>%
  mutate(
    mu2 = mu1 * lambda,
    reps = 1000
  ) %>%
  mutate(
    res = pmap(dplyr::select(., -lambda), .f = Pearson_sim)
  ) %>%
  unnest(cols = res)

If you like, you can simply use the evaluate_by_row() function from the simhelpers package:

sim_results <- 
  params %>%
  mutate( mu2 = mu1 * lambda ) %>%
  evaluate_by_row( Pearson_sim, reps = 1000 )

An advantage of evaluate_by_row() is that the input dataset can include extra variables (such as lambda). Another advantage is that it is easy to run the calculations in parallel; see Chapter 17.

As a final step, we save our results using tidyverse’s write_rds(); see R for Data Science, Section 7.5. We first ensure we have a directory by making one via dir.create() (see Chapter 16.3 for more on files):

dir.create( "results", showWarnings = FALSE )
write_rds( sim_results, file = "results/Pearson_Poisson_results.rds" )

We now have a complete set of simulation results for all of the scenarios we specified.

10.3 When to calculate performance metrics

For a single-scenario simulation, we repeatedly generate and analyze data, and then assess the performance across the repetitions. When we extend this process to multifactor simulations, we have a choice: do we compute performance measures for each simulation scenario as we go (inside) or do we compute all of them after we get all of our individual results (outside)? There are pros and cons to each approach.

10.3.1 Aggregate as you simulate (inside)

The inside approach runs a stand-alone simulation for each scenario of interest. For each combination of factors, we simulate data, apply our estimators, assess performance, and return a table with summary performance measures. We can then stack these tables to get a dataset with all of the results, ready for analysis.

This is the approach we illustrated above. It is straightforward and streamlined: we already have a method to run simulations for a single scenario, and we just repeat it across multiple scenarios and combine the outputs. After calling pmap() (or evaluate_by_row()) and stacking the results, we end up with a dataset containing all the simulation conditions, one simulation context per row (or maybe we have sets of several rows for each simulation context, with one row for each method), with the columns consisting of the simulation factors and measured performance outcomes. This table of performance is ideally all we need to conduct further analysis and write up the results.

The primary advantages of the inside strategy are that it is easy to modularize the simulation code and it produces a compact dataset of results, minimizing the number and size of files that need to be stored. On the con side, calculating summary performance measures inside of the simulation driver limits our ability to add new performance measures on the fly or to examine the distribution of individual estimates. For example, say we wanted to check if the distribution of Fisher-z estimates in a particular scenario was right-skewed, perhaps because we are worried that the estimator sometimes breaks down. We might want to make a histogram of the point estimates, or calculate the skew of the estimates as a performance measure. Because the individual estimates are not saved, we would have no way of investigating these questions without rerunning the simulation for that condition. In short, the inside strategy minimizes disk space but constrains our ability to explore or revise performance calculations.

10.3.2 Keep all simulation runs (outside)

The outside approach involves retaining the entire set of estimates from every replication, with each row corresponding to an estimate for a given simulated dataset. The benefit of the outside approach is that it allows us to add or change how we calculate performance measures without re-running the entire simulation. This is especially important if the simulation is time-intensive, such as when the estimators being evaluated are computationally expensive. The primary disadvantage the outside approach is that it produces large amounts of data that need to be stored and further manipulated. Thus, the outside strategy maximizes flexibility, at the cost of increased dataset size.

In our Pearson correlation simulation, we initially followed the inside strategy. To move to the outside strategy, we can set the summarize argument of Pearson_sim() to FALSE so that the simulation driver returns a row for every replication:

Pearson_sim(reps = 4, N = 15, mu1 = 5, mu2 = 5, rho = 0.5, summarize = FALSE)

##           r         z      CI_lo     CI_hi
## 1 0.6486750 0.7730077  0.2042991 0.8713839
## 2 0.6680333 0.8071830  0.2368083 0.8793686
## 3 0.3530928 0.3689727 -0.1943175 0.7328080
## 4 0.3729984 0.3919016 -0.1721595 0.7432467

We then save the entire set of estimates, rather than the performance summaries. This result file will have \(R\) times as many rows as the older file. In practice, these results can quickly get to be extremely large. But disk space is cheap! Here we run the same experiment as in Section 10.2, but storing the individual replications instead of just the summarized results:

sim_results_full <- 
  params %>%
  mutate( mu2 = mu1 * lambda ) %>%
  evaluate_by_row( Pearson_sim, reps = 1000, summarize = FALSE )

write_rds( sim_results_full, file = "results/Pearson_Poisson_results_full.rds" )

We end up with many more rows. Here is the number of rows for the outside vs inside approach:

c(inside = nrow( sim_results ), outside = nrow( sim_results_full ))

##  inside outside 
##     144  144000

Comparing the file sizes on the disk:

c(
  inside = file.size("results/Pearson_Poisson_results.rds"),
  outside = file.size("results/Pearson_Poisson_results_full.rds")
) / 2^10 # Kb

##     inside    outside 
##   43.14551 9000.31055

The first is several kilobytes, the second is several megabytes.

10.3.3 Getting raw results ready for analysis

If we generate raw results, we then need to do the performance calculations across replications within each simulation context so that we can explore the trends across simulation factors.

One way to do this is to use group_by() and summarize() to carry out the performance calculations:

sim_results_full %>%
  group_by( N, mu1, mu2, rho ) %>%
  summarise( 
    calc_coverage(lower_bound = CI_lo, upper_bound = CI_hi, true_param = rho)
  )

## # A tibble: 144 × 9
## # Groups:   N, mu1, mu2 [18]
##        N   mu1   mu2   rho K_coverage coverage
##    <dbl> <dbl> <dbl> <dbl>      <int>    <dbl>
##  1    10     4     2   0         1000    0.945
##  2    10     4     2   0.1       1000    0.936
##  3    10     4     2   0.2       1000    0.956
##  4    10     4     2   0.3       1000    0.944
##  5    10     4     2   0.4       1000    0.946
##  6    10     4     2   0.5       1000    0.935
##  7    10     4     2   0.6       1000    0.95 
##  8    10     4     2   0.7       1000    0.945
##  9    10     4     4   0         1000    0.946
## 10    10     4     4   0.1       1000    0.944
## # ℹ 134 more rows
## # ℹ 3 more variables: coverage_mcse <dbl>,
## #   width <dbl>, width_mcse <dbl>

If we want to use our full performance measure function evaluate_CIs() to get additional metrics such as MCSEs, we would nest our data into a series of mini-datasets (one for each simulation), and then process each element. As we saw above, nesting collapses a larger dataset into one where one of the variables consists of a list of datasets:

results <- 
  sim_results_full |>
  group_by( N, mu1, mu2, rho ) %>%
  nest( .key = "res" )
results

## # A tibble: 144 × 5
## # Groups:   N, mu1, mu2, rho [144]
##        N   mu1   rho   mu2 res                 
##    <dbl> <dbl> <dbl> <dbl> <list>              
##  1    10     4   0       2 <tibble [1,000 × 4]>
##  2    10     4   0.1     2 <tibble [1,000 × 4]>
##  3    10     4   0.2     2 <tibble [1,000 × 4]>
##  4    10     4   0.3     2 <tibble [1,000 × 4]>
##  5    10     4   0.4     2 <tibble [1,000 × 4]>
##  6    10     4   0.5     2 <tibble [1,000 × 4]>
##  7    10     4   0.6     2 <tibble [1,000 × 4]>
##  8    10     4   0.7     2 <tibble [1,000 × 4]>
##  9    10     4   0       4 <tibble [1,000 × 4]>
## 10    10     4   0.1     4 <tibble [1,000 × 4]>
## # ℹ 134 more rows

Note how each row of our nested data has a little tibble containing the results for that context, with 1000 rows each. Once nested, we can then use map2() to apply a function to each element of res:

results_summary <- 
  results %>%
  mutate( performance = map2( res, rho, evaluate_CIs ) ) %>%
  dplyr::select( -res ) %>%
  unnest( cols="performance" ) 
results_summary

## # A tibble: 144 × 9
## # Groups:   N, mu1, mu2, rho [144]
##        N   mu1   rho   mu2 K_coverage coverage
##    <dbl> <dbl> <dbl> <dbl>      <int>    <dbl>
##  1    10     4   0       2       1000    0.945
##  2    10     4   0.1     2       1000    0.936
##  3    10     4   0.2     2       1000    0.956
##  4    10     4   0.3     2       1000    0.944
##  5    10     4   0.4     2       1000    0.946
##  6    10     4   0.5     2       1000    0.935
##  7    10     4   0.6     2       1000    0.95 
##  8    10     4   0.7     2       1000    0.945
##  9    10     4   0       4       1000    0.946
## 10    10     4   0.1     4       1000    0.944
## # ℹ 134 more rows
## # ℹ 3 more variables: coverage_mcse <dbl>,
## #   width <dbl>, width_mcse <dbl>

We have built our final performance table after running the entire simulation, rather than running it on each simulation scenario in turn.

Now, if we want to add a performance metric, we can simply change evaluate_CIs and recalculate, without having to recompute the entire simulation. Summarizing during the simulation vs. after, as we just did, leads to the same set of results.²⁰ Allowing yourself the flexibility to re-calculate performance measures can be very advantageous, and we tend to follow this outside strategy for any simulations involving more complex estimation procedures.

10.4 Summary

Multifactor simulations are simply a systematically generated series of individual scenario simulations. The overall workflow is to first identify the factors and levels to explore (which we store as a dataset of all the combinations of the factors desired). Think of this as a menu, or checklist, of simulations to run. The next step is to then to walk down the list, running each simulation in turn.

Each individual simulation will generate its own set of results. These can be the raw results (individual simulation iterations) or summary results (performance measures). We stack all of these results, connecting them to the simulation factors they came from, to get a single massive dataset. The key question is then how to explore this full set of results; without much difficulty, the amount of results generated can be kind of overwhelming!

The next chapters dive into how to take on this final task.

10.5 Case Study: A multifactor evaluation of cluster RCT estimators

To bring the multifactor simulation to life, let us return to the case study of comparing three ways to analyze a cluster randomized trial that we presented in Section 6.6. In our original setup, we wrote code to generate cluster randomized trial data where a cluster’s size could be correlated its average treatment effect. Then, in Chapter 9.1.1 we looked across a variety of performance criteria so we could see how the estimators compared for any given scenario we wanted.

So far, we have only examined a single scenario at a time. But how do our findings generalize? Under what conditions do the various estimation methods perform better or worse? To answer these questions, we need to extend to a multifactor simulation to systematically explore how our three estimators behave across a range of contexts. Happily, the modular functions that we have designed make it relatively straightforward to explore a range of scenarios by calling our simulation function over and over, using the tools of this chapter. We illustrate extending to a full multifactor simulation next, and then continue our running example to discuss how to analyze the results in Chapter 12.

10.5.1 Choosing parameters for the Clustered RCT

We begin by identifying some potential research questions suggested by our preliminary exploration. Regarding bias, we noticed in our initial simulation that Linear Regression targets a person-weighted average effect, so it would be considered biased for the cluster-average average treatment effect. We might then ask, how large is bias in practice, and how much does bias change as we change the cluster-size by impact relationship? Considering precision, we saw that Linear Regression has a higher standard error than the other estimators. But is this a general finding? If not, are there contexts where linear regression will have a lower standard error than the others? Further, we originally thought that aggregation would lose information because smaller clusters would have the same weight as larger clusters, but be more imprecisely estimated. Were we wrong? Or perhaps if cluster size was even more variable, aggregation might do worse and worse. Finally, the estimated SEs for all three methods all appeared to be good, although they were rather variable, relative to the true SE. We might then ask, are the standard errors always the right size, on average? Will the estimated SEs fall apart (i.e., be far too large or far too small) in some contexts? If so, which ones?

To answer all of these questions we need to more systematically explore the space of models. But we have a lot of knobs to turn. In particular, our data-generating process will produce artificial cluster-randomized experiments where we can vary any of the following features:

the number of clusters;
the proportion of clusters that receive treatment;
the average cluster size and degree of variation in cluster sizes;
how much the average impact varies across clusters, and how strongly that is connected to cluster size;
how much the cluster intercepts vary (degree of cross-cluster variation); and
the degree of residual variation.

Manipulating all of these factors would lead to a huge and unwieldy number of simulation conditions to evaluate. Before proceeding, we reflect on our research questions, speculate as to what is likely to matter, and then consider varying the following:

Number of clusters: Do cluster-robust SEs work with fewer clusters?
Average cluster size: Does the number of students/cluster matter?
Variation in cluster size: Do varying cluster sizes cause bias or break things?
Correlation of cluster size and cluster impact: Will correlation cause bias?
Cross cluster variation: Does the amount of cluster variation matter?

When selecting factors to manipulate, it is important to ensure the each factor is isolated, so that changing one of them should not change other aspects of the data-generating process that might impact performance. For example, if we simply added more cross-cluster variation by directly increasing the random effects for the clusters, the total variation in the outcome will also increase. If we then see that the performance of an estimator deteriorates as variation increases, we have a confound: is the cross-cluster variation causing the problem, or is it the total variation? To avoid this confound, we should vary cluster variation while holding the total variation fixed; this is why we use the ICC parameterization, as discussed in Section 6.6.

Given our research questions and the way we parameterize the DGP, we end up with the following factors and levels:

crt_design_factors <- list(
  n_bar = c( 20, 80, 320 ),
  J = c( 5, 20, 80 ),
  ATE = c( 0.2 ),
  size_coef = c( 0, 0.2 ),
  ICC = c( 0, 0.2, 0.4, 0.6, 0.8 ),
  alpha = c( 0, 0.5, 0.8 )
)

The ATE factor only has one level. We could later expand this if we wanted to, but based on theoretical concerns we are fairly confident the ATE will not impact our research questions, so we leave it as it is, set at a value we would consider plausible in real life.

10.5.2 Redundant factor combinations

There is some redundancy in the parameter combinations that we have selected. In particular, if size_coef is nonzero, but alpha is 0, then the cluster size will not impact the cluster average ATE because there is no variation in cluster size. We could drop one of the redundant conditions to save some simulation runs—in essence we are running the same simulation for alpha=0 two times, once for each size_coef value, for each level of ICC, n_bar and J. However, doing so would mean that our simulation is no longer fully crossed. We will leave the redundant conditions in as a sanity check for our results. We know we should get the same results and if we do not, then we either have uncontrolled uncertainty in our simulation or an error in the code. If computation is cheap, then keeping the fully crossed structure will make for easier analysis. Otherwise our experiment is not balanced, and so when we average performance across some factors to see the effect of others, we might end up with surprising and misleading results.

10.5.3 Running the simulations

We will run our cluster RCT simulation using the same code pattern as we used with the Pearson correlation simulations. Because we are not exactly sure which performance metrics we will want to use, we will save the individual replications and then calculate performance metrics after generating the simulation results. That is, we will use the outside strategy.

We first make a table of scenarios:

params <- 
  expand_grid( !!!crt_design_factors ) %>%
  mutate(
    seed = 20200320 + 17 * 1:n()
  )

To allow reproducibility, we specify a different seed for each scenario just to avoid anything confusing about shared randomness across scenarios (see Section 8.4 for further discussion). We then run the simulation 1000 times each for scenario, then unnest to get our final data:

params$res <- pmap(params, .f = run_CRT_sim, reps = 1000 )
res <- unnest(params, cols=data)
saveRDS( res, file = "results/simulation_CRT.rds" )

Normally, we would speed up these calculations using parallel processing; we discuss how to do so in Chapter 17. Even under parallel processing, this simulation took overnight to run. Our final results look like this:

## # A tibble: 810,000 × 13
##    n_bar     J   ATE size_coef   ICC alpha  reps
##    <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>
##  1    20     5   0.2         0     0     0  1000
##  2    20     5   0.2         0     0     0  1000
##  3    20     5   0.2         0     0     0  1000
##  4    20     5   0.2         0     0     0  1000
##  5    20     5   0.2         0     0     0  1000
##  6    20     5   0.2         0     0     0  1000
##  7    20     5   0.2         0     0     0  1000
##  8    20     5   0.2         0     0     0  1000
##  9    20     5   0.2         0     0     0  1000
## 10    20     5   0.2         0     0     0  1000
## # ℹ 809,990 more rows
## # ℹ 6 more variables: seed <dbl>, runID <chr>,
## #   method <chr>, ATE_hat <dbl>, SE_hat <dbl>,
## #   p_value <dbl>

We have a lot of rows of data!

10.5.4 Calculating performance metrics

Our next step is to group the results by the simulation factors and calculate the performance metrics across the replications of each simulation condition. Here we calculate our primary performance measures by hand:

res <- readRDS( file = "results/simulation_CRT.rds" )

sres <- 
  res %>% 
  group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>%
  summarise( 
    bias = mean(ATE_hat - ATE),
    SE = sd( ATE_hat ),
    RMSE = sqrt( mean( (ATE_hat - ATE )^2 ) ),
    ESE_hat = sqrt( mean( SE_hat^2 ) ),
    SD_SE_hat = sqrt( sd( SE_hat^2 ) ),
    power = mean( p_value <= 0.05 ),
    R = n(),
    .groups = "drop"
  )

If we want MCSEs (as we usually do), then we could do that by hand or use the simhelpers package as so:

library( simhelpers )

sres <- 
  res %>% 
  group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>%
  summarise( 
    calc_absolute( 
      estimates = ATE_hat, true_param = ATE, 
      criteria = c("bias","stddev","rmse")
    ),
    calc_relative_var( 
      estimates = ATE_hat, var_estimates = SE_hat^2,
      criteria = "relative bias"
    ) 
  ) %>%
  rename( SE = stddev, SE_mcse = stddev_mcse ) %>%
  dplyr::select( -K_absolute, -K_relvar ) %>%
  ungroup()

glimpse( sres )

## Rows: 810
## Columns: 15
## $ n_bar             <dbl> 20, 20, 20, 20, 20, 20…
## $ J                 <dbl> 5, 5, 5, 5, 5, 5, 5, 5…
## $ ATE               <dbl> 0.2, 0.2, 0.2, 0.2, 0.…
## $ size_coef         <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ ICC               <dbl> 0.0, 0.0, 0.0, 0.0, 0.…
## $ alpha             <dbl> 0.0, 0.0, 0.0, 0.5, 0.…
## $ method            <chr> "Agg", "LR", "MLM", "A…
## $ bias              <dbl> 0.0078528924, 0.007852…
## $ bias_mcse         <dbl> 0.006512388, 0.0065123…
## $ SE                <dbl> 0.2059398, 0.2059398, …
## $ SE_mcse           <dbl> 0.004380492, 0.0043804…
## $ rmse              <dbl> 0.2059865, 0.2059865, …
## $ rmse_mcse         <dbl> 0.005468829, 0.0054688…
## $ rel_bias_var      <dbl> 0.9834635, 0.9834635, …
## $ rel_bias_var_mcse <dbl> 0.05290317, 0.05290317…

10.6 Exercises

10.6.1 Brown and Forsythe redux

Take another look at Table 5.1, which is excerpted from Brown and Forsythe (1974). Create a tibble with one row for each of the 20 scenarios that they evaluated. Then create a function for running the full simulation process (see Exercise 8.5.1). Use pmap() or evaluate_by_row() to run simulations of all 20 scenarios and reproduce the results in Table 5.2 of Chapter 5.

10.6.2 Meta-regression

Exercise 6.9.14 described the random effects meta-regression model. List the focal, auxiliary, and structural parameters of this model, and propose a set of design factors to use in a multifactor simulation of the model. Create a list with one entry per factor, then create a dataset with one row for each simulation context that you propose to evaluate.

10.6.3 Comparing the trimmed mean, median and mean

In this exercise, you will write a simulation to compare several different estimators of a common parameter. In particular, you will compare the mean, trimmed mean, and median as estimators of the center of a symmetric distribution (such that the mean and median parameters are identical).
To do this, you should break building this simulation evaluation down into functions for each component of the simulation. This will allow you to extend the same framework to more complicated simulation studies. This extended exercise illustrates how methodologists might compare different estimation strategies, as you might see in the “simulation” section of a stats paper.

As the data-generation function, use a scaled \(t\)-distribution so that the standard deviation will always be 1 but will have different fatness of tails (high chance of outliers):

gen_scaled_t <- function( n, mu, df0 ) {
    mu + rt( n, df=df0 ) / sqrt( df0 / (df0-2) )
}

The variance of a \(t\) distribution is \(df/(df-2)\), so when we divide our observations by the square root of this, we standardize them so they have unit variance.

Verify that gen_scaled_t() produces data with mean mu and standard deviation 1 for various df0 values.
Write a method to calculate the mean, trimmed mean, and median of a vector of data. The trimmed mean should trim 10% of the data from each end. The method should return a data frame with the three estimates, one row per estimator.
Verify your estimation method works by analyzing a dataset generated with gen_scaled_t(). For example, you can generate a dataset of size 100 with gen_scaled_t(100, 0, 3) and then analyze it.
Use bundle_sim() to create a simulation function that generates data and then analyzes it. The function should take n and df0 as arguments, and return the estimates from your analysis method. Use id to give each simulation run an ID.
Run your simulation function for 1000 datasets of size 10, with mu=0 and df0=5. Store the results in a variable called raw_exps.
Write a function to calculate the RMSE, bias, and standard error for your three estimators, given the results.
Make a single function that takes df0 and n, and runs a simulation and returns the performances of your three methods.
Now make a grid of \(n = 10, 50, 250, 1250\) and \(df_0 = 3, 5, 15, 30\), and generate results for your multi-factor simulation.
Make a plot showing how SE changes as a function of sample size for each estimator. Do the three estimator seem to follow the same pattern? Or do they work differently?

References

Brown, Morton B., and Alan B. Forsythe. 1974. “The Small Sample Behavior of Some Statistics Which Test the Equality of Several Means.” Technometrics 16 (1): 129–32. https://doi.org/10.1080/00401706.1974.10489158.

Little, Roderick J. 2013. “In Praise of Simplicity Not Mathematistry! Ten Simple Powerful Ideas for the Statistical Scientist.” Journal of the American Statistical Association 108 (502): 359–69.

expand_grid() is set up to take one argument per factor of the design. A clearer example of its natural syntax is:
```
params <- expand_grid(
  N = c(10, 20, 30),
  mu1 = c(4, 8, 12),
  lambda = c(0.5, 1.0),
  rho = seq(0.0, 0.7, 0.1)
)
```
However, we generally find it useful to create a list of design factors before creating the full grid of parameter values, so we prefer to make design_factors first. To use expand_grid() on a list, we need to use !!!, the splice operator from the rlang package, which treats design_factors as a set of arguments to be passed to expand_grid. The syntax may look a bit wacky, but it is succinct and useful.↩︎
Just like map() or map2(), pmap() has variants such as _dbl or _df. These variants automatically stack or convert the list of things returned into a tidier collection (for _dbl it will convert to a vector of numbers, for _df it will stack the results to make a large dataframe, assuming each thing returned is a little dataframe).↩︎
In fact, if we use the same seed, we should obtain exactly the same results.↩︎