Chapter 12 Designing the multifactor simulation experiment

So far, we’ve created code that will give us results for a single combination of parameter values. In practice, simulation studies typically examine a range of different values, including varying the level of the true parameter values and perhaps also varying sample sizes, to explore a range of different scenarios. We either want reassurance that our findings are general, or we want to understand what aspects of the context lead to our found results. A single simulation gives us no hint as to either of these questions. It is only by looking across a range of settings that we can fully understand trade-offs, general rules, and limits. Let’s now look at the remaining piece of the simulation puzzle: the study’s experimental design.

Simulation studies often take the form of full factorial designed experiments. In full factorials, each factor (a particular knob a researcher might turn to change the simulation conditions) is varied across multiple levels, and the design includes every possible combination of the levels of every factor. One way to represent such a design is as a list of factors and levels.

For example, for the Cronbach alpha simulation, we might want to vary:

  • the sample size, with values of 50 or 100; and
  • the number of items, with values of 4 or 8.
  • the true value of alpha, with values ranging from 0.1 to 0.9;
  • the degrees of freedom of the multivariate \(t\) distribution, with values of 5, 10, 20, or 100;

We first express the simulation parameters as a list of factors, each factor having a list of values to explore. We will then run a simulation for every possible combination of these values. We call this a \(2 \times 2 \times 9 \times 4\) factorial design, where each element is the number of options for that factor. Here is code that generates all the scenarios we will run given the above design, storing these combinations in a data frame, params, that represents the full experimental design:

design_factors <- list(
  n = c(50, 100),
  p = c(4, 8),
  alpha = seq(0.1, 0.9, 0.1),
  df = c(5, 10, 20, 100)
)

params <- cross_df(design_factors)
params
## # A tibble: 144 × 4
##        n     p alpha    df
##    <dbl> <dbl> <dbl> <dbl>
##  1    50     4   0.1     5
##  2   100     4   0.1     5
##  3    50     8   0.1     5
##  4   100     8   0.1     5
##  5    50     4   0.2     5
##  6   100     4   0.2     5
##  7    50     8   0.2     5
##  8   100     8   0.2     5
##  9    50     4   0.3     5
## 10   100     4   0.3     5
## # ℹ 134 more rows

See what we get? The parameters we would pass to run.experiment() correspond to the columns of our dataset. We have a total of \(2 \times 2 \times 9 \times 4 = 144\) rows, each row corresponding to a simulation scenario to explore. With multifactor experiments, it is easy to end up running a lot of experiments!

12.1 Choosing parameter combinations

How do we go about choosing parameter values to examine? Choosing which parameters to use is a central part of good simulation design because the primary limitation of simulation studies is always their generalizability. On the one hand, it’s difficult to extrapolate findings from a simulation study beyond the set of simulation conditions that were examined. On the other hand, it’s often difficult or impossible to examine the full space of all possible parameter values, except for very simple problems. Even in the Cronbach alpha simulation, we’ve got four factors, and the last three could each take an infinite number of different levels, in theory. How can we come up with a defensible set of levels to examine?

The choice of simulation conditions needs to be made in the context of the problem or model that you’re studying, so it’s a bit difficult to offer valid, decontextualized advice. We can provide a few observations all the same:

  1. For research simulations, it often is important to be able to relate your findings to previous research. This suggests that you should select parameter levels to make this possible, such as by looking at sample sizes similar to those examined in previous studies. That said, previous simulation studies are not always perfect (actually, there’s a lot of really crummy ones out there!), and so prior work should not geneally be your sole guide or justification.

  2. Generally, it is better to err on the side of being more comprehensive. You learn more by looking at a broader range of conditions, and you can always boil down your results to a more limited set of conditions for purposes of presentation.

  3. It is also important to explore breakdown points (e.g., what sample size is too small for a method to work?) rather than focusing only on conditions where a method might be expected to work well. Pushing the boundaries and identifying conditions where estimation methods break will help you to provide better guidance for how the methods should be used in practice.

An important point regarding (2) is that you can be more comprehensive and then have fewer replications per scenario. For example, say you were planning on doing 1000 simulations per scenario, but then you realize there is some new factor that you don’t think matters, but that you believe other researchers will worry about. You could add in that factor, say with four levels, and then do 250 simulations per scenario. The total work remains the same.

When analyzing the final simulation you can then first verify you do not see trends along this new factor, and then marganalize out the factor in your summaries of results. Marginalizing out a factor (i.e., averaging your performance metrics across the additional factor) is a powerful technique of making a claim about how your methods work on average across a range of scenarios, rather than for a specific scenario.

Overall, you generally want to vary parameters that you believe matter, or that you think other people will believe matter. The first is so you can learn. The second is to build your case.

Once you have identified your parameters you then have to decide on the levels of the parameter you will include in the simulation. There are three strategies you might take:

  1. Vary a parameter over its entire range (or nearly so).
  2. Choose parameter levels to represent realistic practical range.
    • Empirical justification based on systematic reviews of applications
    • Or at least informal impressions of what’s realistic in practice
  3. Choose parameters to emulate one important application.

In the above (1) is the most general—but also the most computationally intensive. (2) will focus attention, ideally, on what is of practical relevance to a practitioner. (3) is usually coupled with a subsequent applied data analysis, and in this case the simulation is often used to enrich that analysis. For example, if the simulation shows the methods work for data with the given form of the target application, people may be more willing to believe the application’s findings.

Regardless of how you select your primary parameters, you should also vary nuisance parameters (at least a little) to test sensitivity of results. While simulations will (generally) never be fully generalizable, you can certainly make them so they avoid the obvious things a critic might identify as an easy dismissal of your findings.

To recap, as you think about your parameter selection, always keep the following design principles and acknowledgements:

  • The primary limitation of simulation studies is generalizability.
  • Choose conditions that allow you to relate findings to previous work.
  • Err towards being comprehensive.
    • The goal should be to build an understanding of the major moving parts.
    • Presentation of results can always be tailored to illustrate trends.
  • Explore breakdown points (e.g., what sample size is too small for applying a given method?).

And fully expect to add and subtract from your set of parameters as you get your initial simulation results! No one ever runs just a single simulation.

12.1.1 Choosing parameters for the Clustered RCT

Extending our case study presented in Section 6.3 to a multifactor simulation, let’s think about how to design our full experiment.

So far, we have only investigated a single scenario at a time, although our modular approach does make exploring a range of scenarios by re-calling our simulation function relatively straightforward. But how do our findings generalize? When are the different methods differently appropriate? To answer this, we need to extend to a multifactor simulation to systematically explore trends across contexts for our three estimators. We begin by identifying some questions we might have, given our preliminary results.

Regarding bias, in our initial simulation, we noticed that Linear Regression is estimating a person-weighted quantity, and so would be considered biased for the site-average ATE. We might next ask, how much does bias change if we change the site-size by impact relationship?

For precision, we also saw that Linear Regression has a higher standard error. But is this a general finding? When does this occur? Are there contexts where linear regression will do better than the others? Originally we thought aggregation would lose information becuase little sites will have the same weight as big sites, but be more imprecisely estimated. Were we wrong? Or perhaps if site size was even more variable, Agg might do worse and worse.

Finally, the estimated SEs all appeared to be good, although they were rather variable, relative to the true SE. We might then ask, is this always the case? Will the estimated SEs fall apart (e.g., be way too large or way too small, in general) in different contexts?

To answer these questions we need to more systematically explore the space of models. But we have a lot of knobs to turn. In our simulation, we can generate fake cluster randomized data with the following features:

  • The treatment impact of the site can vary, and vary with the site size

  • We can have sites of different sizes if we want

  • We can also vary:

    • the site intercept variance
    • the residual variance,
    • the treatment impact
    • the site size
    • the number of sites, …

We cannot easily vary all of these. We instead reflect on our research questions, speculate as to what is likely to matter, and then consider varying the following:

  • Average site size: Does the number of students/site matter?
  • Number of sites: Do cluster-robust SEs work with fewer sites?
  • Variation in site size: Varying site sizes cause bias or break things?
  • Correlation of site size and site impact: Will correlation cause bias?
  • Cross site variation: Does the amount of site variation matter?

When designing the final factors, it is important to ensure those factors are isolated, in that changing one of them is not changing a host of other things that might impact performance. For example, in our case, if we simply added more cross site variation by directly increasing the random effects for the clusters, our total variation will increase. If we see that methods deteriorate, we then have a confound: is it the cross site variation causing the problem, or is it the total variation? We therefore want to vary site variation while controlling total variation; this is why we use the ICC knob discussed in the section on the data generation process.

We might thus end up with the following for our factors:

crt_design_factors <- list(
  n_bar = c( 20, 80, 320 ),
  J = c( 5, 20, 80 ),
  ATE = c( 0.2 ),
  size_coef = c( 0, 0.2 ),
  ICC = c( 0, 0.2, 0.4, 0.6, 0.8 ),
  alpha = c( 0, 0.5, 0.8 )
)

12.2 Using pmap to run multifactor simulations

To run simulations across all of our factor combinations, we are going to use a very useful method in the purrr package called pmap(). pmap() marches down a set of lists, running a function on each \(p\)-tuple of elements, taking the \(i^{th}\) element from each list for iteration \(i\), and passing them as parameters to the specified function. pmap() then returns the results of this sequence of function calls as a list of results.

Here is a small illustration:

my_function <- function( a, b, theta, scale ) {
    scale * (a + theta*(b-a))
}

args = list( a = 1:3, 
             b = 5:7, 
             theta = c(0.2, 0.3, 0.7) )
purrr::pmap_dbl(  args, my_function, scale = 10 )
## [1] 18 32 58

One important note is the variable names for the lists being iterated over must correspond exactly to function arguments of the called function. Extra parameters can be passed after the function name; these will be held constant, and passed to each function call. See how scale is the same for all calls.

As we see above, pmap() has variants such as _dbl or _df just like the map() and map2() methods. These variants will automatically stack or convert the list of things returned into a tidier collection (for _dbl it will convert to a vector of numbers, for _df it will stack the results to make a large dataframe, assuming each thing returned is a little dataframe).

So far, this is great, but it does not quite look like what we want: our factors are stored as a dataframe, not three lists. This is where R gets interesting: in R, the columns of a dataframe are stored as a list of vectors or lists (with each of the vectors or lists having the exact same length). This works beautifully with pmap(). Witness:

args[[2]]
## [1] 5 6 7
a_df = as.data.frame(args)
a_df
##   a b theta
## 1 1 5   0.2
## 2 2 6   0.3
## 3 3 7   0.7
a_df[[2]]
## [1] 5 6 7
purrr::pmap_dbl( a_df, my_function, scale = 10)
## [1] 18 32 58

We can pass a_df to pmap, and pmap takes it as a list of lists, and therefore does exactly what it did before.

All of this means pmap() can run a specified function on each row of a dataset. Continuing the Cronbach Alpha simulation from above, we would have the following:

params$iterations <- 500
sim_results <-  params %>%
  mutate(res = pmap(., run_alpha_sim ) )

We add a column to params to record the desired 500 replications per condition. The above code calls our run_alpha_sim() method for each row of our list of scenarios we want to explore. Even better, we are storing the results as a new variable in the same dataset.

sim_results
## # A tibble: 144 × 6
##        n     p alpha    df iterations res  
##    <dbl> <dbl> <dbl> <dbl>      <dbl> <lgl>
##  1    50     4   0.1     5        500 NA   
##  2   100     4   0.1     5        500 NA   
##  3    50     8   0.1     5        500 NA   
##  4   100     8   0.1     5        500 NA   
##  5    50     4   0.2     5        500 NA   
##  6   100     4   0.2     5        500 NA   
##  7    50     8   0.2     5        500 NA   
##  8   100     8   0.2     5        500 NA   
##  9    50     4   0.3     5        500 NA   
## 10   100     4   0.3     5        500 NA   
## # ℹ 134 more rows

We are creating a list-column, where each element in our list column is the little summary of our simulation results for that scenario. Here is the third scenario, for example:

sim_results$res[[3]]
## [1] NA

We finally use unnest() to expand the res variable, replicating the values of the main variables once for each row in the nested dataset:

library(tidyr)
sim_results <- unnest(sim_results, cols = res) %>%
  dplyr::select( -iterations )
sim_results
## # A tibble: 144 × 5
##        n     p alpha    df res  
##    <dbl> <dbl> <dbl> <dbl> <lgl>
##  1    50     4   0.1     5 NA   
##  2   100     4   0.1     5 NA   
##  3    50     8   0.1     5 NA   
##  4   100     8   0.1     5 NA   
##  5    50     4   0.2     5 NA   
##  6   100     4   0.2     5 NA   
##  7    50     8   0.2     5 NA   
##  8   100     8   0.2     5 NA   
##  9    50     4   0.3     5 NA   
## 10   100     4   0.3     5 NA   
## # ℹ 134 more rows

We can put all of this together in a a tidy workflow as follows:

sim_results <- 
  params %>%
  mutate(res = pmap(., .f = run_alpha_sim)) %>%
  unnest(cols = res)

If we wanted to use parallel processing (more on this later) we can also simply use the simhelpers package (the following code is auto-generated by the create_skeleton() method as well):

plan(multisession) # choose an appropriate plan from the future package
evaluate_by_row(params, run_alpha_sim)

We finally save our results using tidyverse’s write_csv(); see “R for Data Science” textbook, 11.5. We can ensure we have a directory by making one via dir.create() (see Section 19 for more on files):

dir.create("results", showWarnings = FALSE )
write_csv( sim_results, file = "results/cronbach_results.csv" )

12.3 Ways of repeating oneself

We have three core elements in our simulation: - Generate data - Analyze data - Assess performance

In arranging these elements, we have a choice: do we compute performance measures for each simulation scenario as we go (inside) vs. computing after we get all of our individual results (outside)?

INSIDE (aggregate as you simulate): In this approach, illustrated above, we, for each scenario defined by a specific combination of factors, run our simulation for that scenario, assess the performance, and then return a nice summary table of how well our methods did. This is the most straightforward, given what we have done so far: we have a method to run a simulation for a scenario, and we simply run that method for a bunch of scenarios and collate.

After the pmap() call, we end up with a dataframe with all our simulations, one simulation context per row (or maybe bundles, one for each method), with our measured performance outcomes. This is ideally all we need to analyze.

We have less data to store, and it is easier to compartmentalize. On the cons side, we have no ability to add new performance measures on the fly.

Overall, this seems pretty good. That being said, sometimes we might want to use a lot of disk space and keep much more. In particular, each row of the above corresponds to the summary of a whole collection of individual runs. We might instead store all of these runs, which brings us to the other approach.

OUTSIDE (keep all simulation runs): In this approach we do not aggregate, but instead, for each scenario, return the entire set of individual estimates. The benefit of this is, given the raw estimates, you can dynamically add or change how you calculate performance measures. You do, however, end up with massive amounts of data to store and manipulate.

To move from inside to outside, we just take the summarizing step out of run_alpha_sim(). E.g.,:

run_alpha_sim_raw <- function(iterations, n, p, alpha, df, coverage = 0.95, seed = NULL) {
  
  if (!is.null(seed)) set.seed(seed)
  
  results <- 
    replicate(n = iterations, {
      dat <- r_mvt_items(n = n, p = p, alpha = alpha, df = df)
      estimate_alpha(dat, coverage = coverage)
    }, simplify = FALSE) %>%
    bind_rows()
  
  results
}

Each call to run_alpha_sim_raw() now gives one row per simulation trial. We replicate our simulation parameters for each row.

run_alpha_sim_raw( 4, 50, 6, 0.5, 3 )
##           A      Var_A       CI_L      CI_U
## 1 0.5402995 0.01014358 0.29374115 0.7007831
## 2 0.3867717 0.01805035 0.05786946 0.6008526
## 3 0.4617342 0.01390704 0.17303773 0.6496454
## 4 0.4810237 0.01292815 0.20267301 0.6622008

The primary advantage of this is we can then generate new performance measures, as they occur to us, later on. The disadvantage is this result file will be \(R\) times as many rows as the older file, which can get quite, quite large.

But disk space is cheap! Here we run the same experiment with our more complete storage. Note how the pmap_df stacks the multiple rows from each run, giving us everything nicely bundled up:

params$res <- params %>% 
    pmap( run_alpha_sim_raw, iterations = 500 )
sim_results_full <- unnest( params,
                            cols = res ) 
write_csv( sim_results_full, "results/cronbach_results_full.csv" )

We end up with a lot more rows:

nrow( sim_results_full )
## [1] 72000
nrow( sim_results )
## [1] 144

Compare the file sizes: one is several k, the other is around 20 megabytes.

file.size("results/cronbach_results.csv") / 1024
## [1] 4.103516
file.size("results/cronbach_results_full.csv") / 1024
## [1] 7516.222

12.3.1 Getting raw results ready for analysis

If we generated raw results then we need to collapse them by experimental run before calculating performance measures so we can explore the trends across the experiments. We do this by grouping our data and calling alpha_performance() for each group:

results <- sim_results_full %>%
  nest_by( n, p, alpha, df, .key = "alpha_sims" )
results
## # A tibble: 144 × 5
## # Rowwise:  n, p, alpha, df
##        n     p alpha    df         alpha_sims
##    <dbl> <dbl> <dbl> <dbl> <list<tibble[,6]>>
##  1    50     4   0.1     5          [500 × 6]
##  2    50     4   0.1    10          [500 × 6]
##  3    50     4   0.1    20          [500 × 6]
##  4    50     4   0.1   100          [500 × 6]
##  5    50     4   0.2     5          [500 × 6]
##  6    50     4   0.2    10          [500 × 6]
##  7    50     4   0.2    20          [500 × 6]
##  8    50     4   0.2   100          [500 × 6]
##  9    50     4   0.3     5          [500 × 6]
## 10    50     4   0.3    10          [500 × 6]
## # ℹ 134 more rows
results$performance = map2( results$alpha_sims, 
                               results$alpha, alpha_performance )
results <- results %>%
  dplyr::select( -alpha_sims ) %>%
  unnest( cols="performance" ) 
results
## # A tibble: 576 × 7
## # Groups:   n, p, alpha, df [144]
##        n     p alpha    df criterion           est
##    <dbl> <dbl> <dbl> <dbl> <chr>             <dbl>
##  1    50     4   0.1     5 alpha bias      -0.103 
##  2    50     4   0.1     5 alpha RMSE       0.388 
##  3    50     4   0.1     5 V relative bias  0.436 
##  4    50     4   0.1     5 coverage         0.826 
##  5    50     4   0.1    10 alpha bias      -0.0527
##  6    50     4   0.1    10 alpha RMSE       0.263 
##  7    50     4   0.1    10 V relative bias  0.780 
##  8    50     4   0.1    10 coverage         0.908 
##  9    50     4   0.1    20 alpha bias      -0.0190
## 10    50     4   0.1    20 alpha RMSE       0.221 
## # ℹ 566 more rows
## # ℹ 1 more variable: MCSE <dbl>
# NOTE: This is HORRIBLE.  There has to be a better way.

Now, if we want to add a performance metric, we can simply change alpha_performance and recalculate, without running the time-intensive simulations. Being able to re-analyze your results is generally a far easier fix than running all the simulations again after changing the run_alpha_sim() method.

The results of summarizing during the simulation vs. after as we just did leads to essentially the same place, however, although our old results are in long format (with one row per simulation metric vs. the metrics being columns).

12.4 Running the Cluster RCT multifactor experiment

Running our cluster RCT simulation is the exact same code as we have used before. Simulations take awhile to run so we save them so we can analyze at our leisure. Because we are not exactly sure what performance metrics we want, we will save our individual results, and calculate performance metrics on the full data. I.e., we are storing the individual runs, not the analyzed results!

The code is as follows:

params <- 
  cross_df(crt_design_factors) %>%
  mutate(
    reps = 100,
    seed = 20200320 + 1:n()
  )
params$res = pmap(params, .f = run_CRT_sim )
res = params %>% unnest( cols=c(data) )
saveRDS( res, file = "results/simulation_CRT.rds" )

The seed is for reproducibility; we discuss this more later on.

We then group by our simulation factors and calculate all our performance metrics at once directly. For example, here is the code for calculating performance measures across our simulation for the cluster randomized experiments example:

res <- readRDS( file = "results/simulation_CRT.rds" )

sres <- 
  res %>% 
  group_by( n_bar, J, ATE, size_coef, ICC, alpha, method ) %>%
  summarise( 
    bias = mean(ATE_hat - ATE),
    SE = sd( ATE_hat ),
    RMSE = sqrt( mean( (ATE_hat - ATE )^2 ) ),
    ESE_hat = sqrt( mean( SE_hat^2 ) ),
    SD_SE_hat = sqrt( sd( SE_hat^2 ) ),
    power = mean( p_value <= 0.05 ),
    R = n(),
    .groups = "drop"
  )
sres
## # A tibble: 810 × 14
##    n_bar     J   ATE size_coef   ICC alpha method
##    <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <chr> 
##  1    20     5   0.2         0   0     0   Agg   
##  2    20     5   0.2         0   0     0   LR    
##  3    20     5   0.2         0   0     0   MLM   
##  4    20     5   0.2         0   0     0.5 Agg   
##  5    20     5   0.2         0   0     0.5 LR    
##  6    20     5   0.2         0   0     0.5 MLM   
##  7    20     5   0.2         0   0     0.8 Agg   
##  8    20     5   0.2         0   0     0.8 LR    
##  9    20     5   0.2         0   0     0.8 MLM   
## 10    20     5   0.2         0   0.2   0   Agg   
## # ℹ 800 more rows
## # ℹ 7 more variables: bias <dbl>, SE <dbl>,
## #   RMSE <dbl>, ESE_hat <dbl>, SD_SE_hat <dbl>,
## #   power <dbl>, R <int>

12.4.1 Making analyze_data() quiet

If we run our simulation when there is little cluster variation, we start getting a lot of messages and warnings from our MLM estimator. For example, from a single call we get:

res <- analyze_data(dat)

When we scale up to our full simulations, these warnings can become a nuisance. Furthermore, we have found that the lmer command can sometimes just fail (we believe there is some bug in the optimizer that fails if things are just perfectly wrong). If this was on simulation run 944 out of 1000, we would lose everything! To protect ourselves, we trap messages and warnings as so (see Chapter @(#safe_code) for more on this):

quiet_lmer = quietly( lmer )
analyze_data <- function( dat ) {
    
    # MLM
    M1 <- quiet_lmer( Yobs ~ 1 + Z + (1|sid), data=dat )
    message1 = ifelse( length( M1$message ) > 0, 1, 0 )
    warning1 = ifelse( length( M1$warning ) > 0, 1, 0 )

   ...

    # Compile our results
    tibble( 
      method = c( "MLM", "LR", "Agg" ),
      ATE_hat = c( est1, est2, est3 ),
      SE_hat = c( se1, se2, se3 ),
      p_value = c( pv1, pv2, pv3 ),
      message = c( message1, 0, 0 ),
      warning = c( warning1, 0, 0 )
    )
}

We now get a note about the message regarding convergence saved in our results:

res <- analyze_data(dat)
res
## # A tibble: 3 × 6
##   method ATE_hat SE_hat p_value message warning
##   <chr>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 MLM    -0.376   1.44    0.842       0       0
## 2 LR     -0.0973  0.784   0.921       0       0
## 3 Agg    -0.440   0.856   0.698       0       0

See? No more warnings, but we see the message as a variable in our results.