Chapter 11 Designing multifactor simulations

Thus far, we have created code that will run a simulation for a single combination of parameter values. In practice, simulation studies typically examine a range of different values, including varying the levels of the focal parameter values, auxiliary parameters, sample size, and possibly other design parameters, to explore a range of different scenarios. We either want reassurance that our findings are general, or we want to understand what aspects of the context affect the performance of the estimator or estimators we are studying. A single simulation gives us no hint as to either of these questions. It is only by looking across a range of settings that we can hope to understand trade-offs, general rules, and limits. Let’s now look at the remaining piece of the simulation puzzle: the study’s experimental design.

Simulation studies often take the form of full factorial designed experiments. In full factorials, each factor (a particular knob a researcher might turn to change the simulation conditions) is varied across multiple levels, and the design includes every possible combination of the levels of every factor. One way to represent such a design is as a list of the factors and levels to be explored.

The multi-factor aspect of a simulation is incredible important. It can take us from an overly narrow exploration to one that has broader significance. As @little2013praise puts it:

Good simulation studies are not given the respect they deserve. Often the design is perfunctory and simplistic, neglecting to attempt a factorial experimental design to cover the relevant sample space, and results are over-generalized. Well designed simulation studies with realistic sample sizes are an antidote to a ﬁxation on asymptotics and a useful tool for assessing calibration.

11.1 Choosing parameter combinations

How do we go about choosing parameter values to examine? Choosing which parameters to use is a central part of good simulation design because the primary limitation of simulation studies is always their generalizability. On the one hand, it is difficult to extrapolate findings from a simulation study beyond the set of simulation conditions that were examined. On the other hand, it is often difficult or impossible to examine the full space of all possible parameter values, except for very simple problems. Even in the relatively straightforward Pearson correlation simulation, we have four factors and the last three could each take on an infinite number of possible levels. How can we come up with a defensible set of levels to examine?

Broadly, you generally want to vary parameters that you believe matter, or that you think other people will believe matter. The first is so you can learn. The second is to build your case. The choice of simulation conditions needs to be made in the context of the problem or model that you are studying, so it is difficult to identify valid but acontextual principles. We nonetheless offer a few points of advice, informed by our experience conducting and reading simulation studies:

For research simulations, it often is important to be able to relate your findings to previous research. This suggests that you should select parameter levels to make this possible, such as by looking at sample sizes similar to those examined in previous studies. That said, previous simulation studies are not always perfect (actually, there are a lot of really crummy ones out there!), and so prior work should not generally be your sole guide or justification.
Generally, it is better to err on the side of being more comprehensive. You learn more by looking at a broader range of conditions, and you can always boil down your results to a more limited set of conditions for purposes of presentation.
It is also important to explore breakdown points (e.g., what sample size is too small for a method to work?) rather than focusing only on conditions where a method might be expected to work well. Pushing the boundaries and identifying conditions where estimation methods break will help you to provide better guidance for how the methods should be used in practice.

On point (2), comprehensiveness will generally increase the amount of computing required for a simulation. However, this can be tempered by reducing the number of replications per scenario. For example, say you were planning on doing 1000 simulations per scenario, but then you realize there is some other factor that you do not think matters, but that you believe other researchers will worry about. You could add in that factor, say with four levels, and then do 250 simulations per scenario. The total work remains the same. When analyzing the final simulation you would first verify you do not see trends along this new factor, and then marginalize out that factor in your summaries of results. Marginalizing out a factor (i.e., averaging your performance metrics across the additional factor) is a powerful technique of making a claim about how your methods work on average across a range of scenarios, rather than for a specific scenario. We will discuss it further in Chapter 12.

Once you have identified your parameters, you then have to decide on the levels of the parameter you will include in the simulation. There are three strategies you might take:

Vary a parameter over as much of its range as you can.
Choose parameter levels to represent a realistic practical range. These ranges would ideally be empirically justified based on systematic reviews of prior applications. Lacking such empirical evidence, you may need to rely on informal impressions of what is realistic in practice.
Choose parameters to emulate an important known application or context.

Of these choices, option (1) is the most general but also the most computationally intensive. In the ideal, option (2) focuses attention on what is of practical relevance to a practitioner. Option (3) is usually coupled with a subsequent applied data analysis, and in this case the simulation is often used to enrich that analysis. In particular, if the simulation shows the methods work for data with the given form of the target application, people may be more willing to believe the application’s findings.

Regardless of how you select your primary parameters, you should also vary nuisance parameters (at least a little) to test the sensitivity of your results to these other aspects. While simulations will (generally) never be fully generalizable, you can certainly make them so they avoid the obvious things a critic might identify as a basis for dismissing your findings.

To recap, as you think about your parameter selection, always keep the following design principles and acknowledgements:

The primary limitation of simulation studies is generalizability.
Choose conditions that allow you to relate your findings to previous work.
Err towards being comprehensive. Your goal should be to build an understanding of the major moving parts, and you can always tailor your final presentation of results to give the simplified story, once you find it.
Explore breakdown points and boundary conditions (e.g., what sample size is too small for applying a given method?).

Finally, you should fully expect to add and subtract from your set of simulation factors as you get your initial simulation results. Rarely does anyone nail the choice of parameters on the first pass.

11.2 Case Study: A multifactor evaluation of cluster RCT estimators

To bring the multifactor simulation to life, let us return to the case study of comparing three ways to analyze a cluster randomized trial that we presented in Section 6.6. In our original setup, we wrote code to generate cluster randomized trial data where a cluster’s size could be correlated its average treatment effect. Then, in Chapter 9.1.1 we looked across a variety of performance criteria so we could see how the estimators compared for any given scenario we wanted.

So far, we have only examined a single scenario at a time. But how do our findings generalize? Under what conditions do the various estimation methods perform better or worse? To answer these questions, we need to extend to a multifactor simulation to systematically explore how our three estimators behave across a range of contexts. Happily, the modular functions that we have designed make it relatively straightforward to explore a range of scenarios by calling our simulation function over and over, using the tools of this chapter. We illustrate extending to a full multifactor simulation next, and then continue our running example to discuss how to analyze the results in Chapter 13.

11.2.1 Choosing parameters for the Clustered RCT

We begin by identifying some potential research questions suggested by our preliminary exploration. Regarding bias, we noticed in our initial simulation that Linear Regression targets a person-weighted average effect, so it would be considered biased for the cluster-average average treatment effect. We might then ask, how large is bias in practice, and how much does bias change as we change the cluster-size by impact relationship? Considering precision, we saw that Linear Regression has a higher standard error than the other estimators. But is this a general finding? If not, are there contexts where linear regression will have a lower standard error than the others? Further, we originally thought that aggregation would lose information because smaller clusters would have the same weight as larger clusters, but be more imprecisely estimated. Were we wrong? Or perhaps if cluster size was even more variable, aggregation might do worse and worse. Finally, the estimated SEs for all three methods all appeared to be good, although they were rather variable, relative to the true SE. We might then ask, are the standard errors always the right size, on average? Will the estimated SEs fall apart (i.e., be far too large or far too small) in some contexts? If so, which ones?

To answer all of these questions we need to more systematically explore the space of models. But we have a lot of knobs to turn. In particular, our data-generating process will produce artificial cluster-randomized experiments where we can vary any of the following features:

the number of clusters;
the proportion of clusters that receive treatment;
the average cluster size and degree of variation in cluster sizes;
how much the average impact varies across clusters, and how strongly that is connected to cluster size;
how much the cluster intercepts vary (degree of cross-cluster variation); and
the degree of residual variation.

Manipulating all of these factors would lead to a huge and unwieldy number of simulation conditions to evaluate. Before proceeding, we reflect on our research questions, speculate as to what is likely to matter, and then consider varying the following:

Number of clusters: Do cluster-robust SEs work with fewer clusters?
Average cluster size: Does the number of students/cluster matter?
Variation in cluster size: Do varying cluster sizes cause bias or break things?
Correlation of cluster size and cluster impact: Will correlation cause bias?
Cross cluster variation: Does the amount of cluster variation matter?

When selecting factors to manipulate, it is important to ensure the each factor is isolated, so that changing one of them should not change other aspects of the data-generating process that might impact performance. For example, if we simply added more cross-cluster variation by directly increasing the random effects for the clusters, the total variation in the outcome will also increase. If we then see that the performance of an estimator deteriorates as variation increases, we have a confound: is the cross-cluster variation causing the problem, or is it the total variation? To avoid this confound, we should vary cluster variation while holding the total variation fixed; this is why we use the ICC parameterization, as discussed in Section 6.6.

Given our research questions and the way we parameterize the DGP, we end up with the following factors and levels:

crt_design_factors <- list(
  n_bar = c( 20, 80, 320 ),
  J = c( 5, 20, 80 ),
  ATE = c( 0.2 ),
  size_coef = c( 0, 0.2 ),
  ICC = c( 0, 0.2, 0.4, 0.6, 0.8 ),
  alpha = c( 0, 0.5, 0.8 )
)

The ATE factor only has one level. We could later expand this if we wanted to, but based on theoretical concerns we are fairly confident the ATE will not impact our research questions, so we leave it as it is, set at a value we would consider plausible in real life.

11.2.2 Redundant factor combinations

There is some redundancy in the parameter combinations that we have selected. In particular, if size_coef is nonzero, but alpha is 0, then the cluster size will not impact the cluster average ATE because there is no variation in cluster size. We could drop one of the redundant conditions to save some simulation runs—in essence we are running the same simulation for alpha=0 two times, once for each size_coef value, for each level of ICC, n_bar and J. However, doing so would mean that our simulation is no longer fully crossed. We will leave the redundant conditions in as a sanity check for our results. We know we should get the same results and if we do not, then we either have uncontrolled uncertainty in our simulation or an error in the code. If computation is cheap, then keeping the fully crossed structure will make for easier analysis. Otherwise our experiment is not balanced, and so when we average performance across some factors to see the effect of others, we might end up with surprising and misleading results.

11.2.3 Running the simulations

We will run our cluster RCT simulation using the same code pattern as we used with the Pearson correlation simulations. Because we are not exactly sure which performance metrics we will want to use, we will save the individual replications and then calculate performance metrics after generating the simulation results. That is, we will use the outside strategy.

We first make a table of scenarios:

params <- 
  expand_grid( !!!crt_design_factors ) %>%
  mutate(
    seed = 20200320 + 17 * 1:n()
  )

To allow reproducibility, we specify a different seed for each scenario just to avoid anything confusing about shared randomness across scenarios (see Section 8.5 for further discussion). We then run the simulation 1000 times each for scenario, then unnest to get our final data:

params$res <- pmap(params, .f = run_CRT_sim, reps = 1000 )
res <- unnest(params, cols=data)
saveRDS( res, file = "results/simulation_CRT.rds" )

Normally, we would speed up these calculations using parallel processing; we discuss how to do so in Chapter 18. Even under parallel processing, this simulation took overnight to run. Our final results look like this:

## # A tibble: 810,000 × 16
##    n_bar     J   ATE size_coef   ICC alpha   seed
##    <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>  <dbl>
##  1   320     5   0.2       0.2   0.8   0.5 103553
##  2   320     5   0.2       0.2   0.8   0.5 103553
##  3   320     5   0.2       0.2   0.8   0.5 103553
##  4   320     5   0.2       0.2   0.8   0.5 103553
##  5   320     5   0.2       0.2   0.8   0.5 103553
##  6   320     5   0.2       0.2   0.8   0.5 103553
##  7   320     5   0.2       0.2   0.8   0.5 103553
##  8   320     5   0.2       0.2   0.8   0.5 103553
##  9   320     5   0.2       0.2   0.8   0.5 103553
## 10   320     5   0.2       0.2   0.8   0.5 103553
## # ℹ 809,990 more rows
## # ℹ 9 more variables: runID <int>, method <chr>,
## #   ATE_hat <dbl>, SE_hat <dbl>, p_value <dbl>,
## #   message <dbl>, warning <dbl>, error <dbl>,
## #   seconds <dbl>

We have a lot of rows of data!

11.2.4 Calculating performance metrics

Our next step is to group the results by the simulation factors and calculate the performance metrics across the replications of each simulation condition. Here we calculate our primary performance measures by hand:

res <- readRDS( file = "results/simulation_CRT.rds" )

sres <- 
  res %>% 
  group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>%
  summarise( 
    bias = mean(ATE_hat - ATE),
    SE = sd( ATE_hat ),
    RMSE = sqrt( mean( (ATE_hat - ATE )^2 ) ),
    ESE_hat = sqrt( mean( SE_hat^2 ) ),
    SD_SE_hat = sqrt( sd( SE_hat^2 ) ),
    power = mean( p_value <= 0.05 ),
    R = n(),
    .groups = "drop"
  )

If we want MCSEs (as we usually do), then we could do that by hand or use the simhelpers package as so:

library( simhelpers )

sres <- 
  res %>% 
  group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>%
  summarise( 
    calc_absolute( 
      estimates = ATE_hat, true_param = ATE, 
      criteria = c("bias","stddev","rmse")
    ),
    calc_relative_var( 
      estimates = ATE_hat, var_estimates = SE_hat^2,
      criteria = "relative bias"
    ) 
  ) %>%
  rename( SE = stddev, SE_mcse = stddev_mcse ) %>%
  dplyr::select( -K_absolute, -K_relvar ) %>%
  ungroup()

glimpse( sres )

## Rows: 810
## Columns: 15
## $ n_bar             <dbl> 20, 20, 20, 20, 20, 20…
## $ J                 <dbl> 5, 5, 5, 5, 5, 5, 5, 5…
## $ ATE               <dbl> 0.2, 0.2, 0.2, 0.2, 0.…
## $ size_coef         <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
## $ ICC               <dbl> 0.0, 0.0, 0.0, 0.0, 0.…
## $ alpha             <dbl> 0.0, 0.0, 0.0, 0.5, 0.…
## $ method            <chr> "Agg", "LR", "MLM", "A…
## $ bias              <dbl> 0.0078528924, 0.007852…
## $ bias_mcse         <dbl> 0.006512388, 0.0065123…
## $ SE                <dbl> 0.2059398, 0.2059398, …
## $ SE_mcse           <dbl> 0.004380492, 0.0043804…
## $ rmse              <dbl> 0.2059865, 0.2059865, …
## $ rmse_mcse         <dbl> 0.005468829, 0.0054688…
## $ rel_bias_var      <dbl> 0.9834635, 0.9834635, …
## $ rel_bias_var_mcse <dbl> 0.05290317, 0.05290317…

Either way, we now have a lot of results to wade through. After we calculate our performance measures, we have 810 rows of results across 270 different scenarios. We can simply aggregate across everything to get overall comparisons:

sres %>%
  group_by( method ) %>%
  summarise( mean_bias = mean( bias ),
             mean_SE = mean( SE ),
             mean_rmse = mean( rmse ) )

## # A tibble: 3 × 4
##   method mean_bias mean_SE mean_rmse
##   <chr>      <dbl>   <dbl>     <dbl>
## 1 Agg     0.000783   0.311     0.311
## 2 LR      0.00863    0.320     0.321
## 3 MLM     0.00227    0.310     0.310

Even with this very crude summary of the results, we see tantalizing hints of differences between the methods. Linear Regression does seem biased, on average, across the scenarios considered. But so does multilevel modeling—we did not see bias before, but apparently for some scenarios it is biased as well. Overall SE and RMSE seems roughly the same across scenarios, and it is clear that uncertainty trumps bias, at least on average. However, this pattern could be driven by the scenarios with few clusters that are small in size. In order to understand the full story of what is going on, we need to dig further into the results.

11.3 Conclusions

In this chapter, we have moved from isolated simulation scenarios to fully specified multifactor simulation designs. By systematically varying multiple features of a data-generating process while holding others fixed, we can explore how the performance of an estimator depends on the context in which it is used. By thinking of simulations as designed experiments, we gain a principled framework for choosing parameters and curating a set of scenarios to explore. The result is a rich collection of simulation output that can capture bias, variability, uncertainty estimation, or testing behavior across a wide range of plausible conditions.

However, the volume and complexity of systematic simulation results also create new challenges for interpretation. With many factors, levels, and performance measures, it is no longer possible to understand results simply by inspecting a table of results. In the next few chapters, we therefore turn to the problem of analyzing and presenting simulation results, with an emphasis on graphical approaches that clarify how performance varies across conditions. Through further analysis, we will seek to identify systematic trends in estimator performance, understand the generality of identified patterns, and identify trade-offs between different methods.