Chapter 13 Special Topics on Reporting Simulation Results

In this chapter we cover some special topics on reporting simulation results. We first walk through some examples of how to do regression modeling. We then dive more deeply into what to do when you have only a few iterations per scenario, and then we discuss what to do when you are evaluating methods that sometimes fail to converge or give an answer.

13.1 Using regression to analyze simulation results

In Chapter 11 we saw some examples of using regression and ANOVA to analyze simulation results. We next provide some further in-depth examples that give the code for doing this sort of thing.

13.1.1 Example 1: Biserial, revisited

We first give the code that produced the final ANOVA summary table for the biserial correlation example in Chapter 11. In the visualization there, we saw that several factors appeared to impact bias, but we might want to get a sense of how much. Under modeling of that same chapter, we saw a table that partialed out the variance across several factors so we could see which simulation factors mattered most for bias.

To build that table, we first fit a regression model to see:

options(scipen = 5)
mod = lm( bias ~ fixed + rho + I(rho^2) + p1 + n, data = r_F)
summary(mod, digits=2)
## 
## Call:
## lm(formula = bias ~ fixed + rho + I(rho^2) + p1 + n, data = r_F)
## 
## Residuals:
##        Min         1Q     Median         3Q 
## -0.0215935 -0.0013608  0.0003823  0.0015677 
##        Max 
##  0.0081802 
## 
## Coefficients:
##                        Estimate  Std. Error
## (Intercept)         -0.00186446  0.00017971
## fixedSample cut-off -0.00363520  0.00009733
## rho                 -0.00942338  0.00069578
## I(rho^2)             0.00720857  0.00070868
## p1p1 = 1/3           0.00358696  0.00015390
## p1p1 = 1/4           0.00482709  0.00015390
## p1p1 = 1/5           0.00547657  0.00015390
## p1p1 = 1/8           0.00635532  0.00015390
## n.L                  0.00362949  0.00010882
## n.Q                 -0.00103981  0.00010882
## n.C                  0.00027941  0.00010882
## n^4                  0.00001976  0.00010882
##                     t value Pr(>|t|)    
## (Intercept)         -10.375   <2e-16 ***
## fixedSample cut-off -37.347   <2e-16 ***
## rho                 -13.544   <2e-16 ***
## I(rho^2)             10.172   <2e-16 ***
## p1p1 = 1/3           23.307   <2e-16 ***
## p1p1 = 1/4           31.365   <2e-16 ***
## p1p1 = 1/5           35.585   <2e-16 ***
## p1p1 = 1/8           41.295   <2e-16 ***
## n.L                  33.352   <2e-16 ***
## n.Q                  -9.555   <2e-16 ***
## n.C                   2.568   0.0103 *  
## n^4                   0.182   0.8559    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003372 on 4788 degrees of freedom
## Multiple R-squared:  0.5107, Adjusted R-squared:  0.5096 
## F-statistic: 454.4 on 11 and 4788 DF,  p-value: < 2.2e-16

The above printout gives main effects for each factor, averaged across other factors. Because p1 and n are ordered factors, the lm() command automatically generates linear, quadradic, cubic and fourth order contrasts for them. We smooth our rho factor, which has many levels of a continuous measure, with a quadratic curve. We could instead use splines or some local linear regression if we were worried about model fit for a complex relationship.

The main effects are summaries of trends across contexts. For example, averaged across the other contexts, the “sample cutoff” condition is around 0.004 lower than the population (the baseline condition).

We can also use ANOVA to get a sense of the major sources of variation in the simulation results (e.g., identifying which factors have negligible/minor influence on the bias of an estimator). To do this, we use aov() to fit an analysis of variance model:

anova_table <- aov(bias ~ rho * p1 * fixed * n, data = r_F)
summary(anova_table)
##                  Df   Sum Sq  Mean Sq  F value
## rho               1 0.002444 0.002444  1673.25
## p1                4 0.023588 0.005897  4036.41
## fixed             1 0.015858 0.015858 10854.52
## n                 4 0.013760 0.003440  2354.60
## rho:p1            4 0.001722 0.000431   294.71
## rho:fixed         1 0.003440 0.003440  2354.69
## p1:fixed          4 0.001683 0.000421   287.98
## rho:n             4 0.002000 0.000500   342.31
## p1:n             16 0.019810 0.001238   847.51
## fixed:n           4 0.013359 0.003340  2285.97
## rho:p1:fixed      4 0.000473 0.000118    80.87
## rho:p1:n         16 0.001470 0.000092    62.91
## rho:fixed:n       4 0.002929 0.000732   501.23
## p1:fixed:n       16 0.001429 0.000089    61.12
## rho:p1:fixed:n   16 0.000429 0.000027    18.36
## Residuals      4700 0.006866 0.000001         
##                Pr(>F)    
## rho            <2e-16 ***
## p1             <2e-16 ***
## fixed          <2e-16 ***
## n              <2e-16 ***
## rho:p1         <2e-16 ***
## rho:fixed      <2e-16 ***
## p1:fixed       <2e-16 ***
## rho:n          <2e-16 ***
## p1:n           <2e-16 ***
## fixed:n        <2e-16 ***
## rho:p1:fixed   <2e-16 ***
## rho:p1:n       <2e-16 ***
## rho:fixed:n    <2e-16 ***
## p1:fixed:n     <2e-16 ***
## rho:p1:fixed:n <2e-16 ***
## Residuals                
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The advantage here is the multiple levels of our categorical factors get bundled together in our table of results, making a tidier display.

The table in Chapter 11 is a summary of this anova table, which we generate as follows:

library(lsr)
etaSquared(anova_table) %>%
  as.data.frame() %>%
  rownames_to_column("source") %>%
  mutate( order = 1 + str_count(source, ":" ) ) %>%
  group_by( order ) %>%
  arrange( -eta.sq, .by_group = TRUE ) %>%
  relocate( order ) %>%
  knitr::kable( digits = 2 )

We group the results by the order of the interaction, so that we can see the main effects first, then two-way interactions, and so on. We then sort within each group to put the high importance factors first. The resulting variance decomposition table (see Chapter 11) shows the amount of variation explained by each combination of factors.

13.1.2 Example 2: Cluster RCT example, revisited

When we have several methods to compare, we can also use meta-regression to understand how these methods change as other simulation factors change. We next continue our running Cluster RCT example.

We first turn our simulation levels (except for ICC, which has several levels) into factors, so R does not assume that sample size, for example, should be treated as a continuous variable:

sres_f <-
  sres %>%
  mutate( 
    across( c( n_bar, J, size_coef, alpha ), factor ),
    ICC = as.numeric(ICC)
  )

# Run the regression
M <- lm( bias ~ (n_bar + J + size_coef + ICC + alpha) * method, 
         data = sres_f )

# View the results
stargazer::stargazer(M, type = "text", single.row = TRUE )
## 
## ==================================================
##                            Dependent variable:    
##                        ---------------------------
##                                   bias            
## --------------------------------------------------
## n_bar80                      -0.003 (0.002)       
## n_bar320                     -0.001 (0.002)       
## J20                          -0.002 (0.002)       
## J80                          -0.002 (0.002)       
## size_coef0.2                  0.003 (0.002)       
## ICC                           0.001 (0.003)       
## alpha0.5                     -0.002 (0.002)       
## alpha0.8                      0.001 (0.002)       
## methodLR                    -0.012*** (0.004)     
## methodMLM                     0.001 (0.004)       
## n_bar80:methodLR             0.0001 (0.003)       
## n_bar320:methodLR           -0.00001 (0.003)      
## n_bar80:methodMLM            -0.0005 (0.003)      
## n_bar320:methodMLM           -0.001 (0.003)       
## J20:methodLR                 0.005* (0.003)       
## J80:methodLR                 0.006* (0.003)       
## J20:methodMLM                 0.001 (0.003)       
## J80:methodMLM                 0.001 (0.003)       
## size_coef0.2:methodLR       0.016*** (0.002)      
## size_coef0.2:methodMLM        0.003 (0.002)       
## ICC:methodLR                 -0.0003 (0.004)      
## ICC:methodMLM                -0.006 (0.004)       
## alpha0.5:methodLR            0.006** (0.003)      
## alpha0.8:methodLR           0.017*** (0.003)      
## alpha0.5:methodMLM            0.001 (0.003)       
## alpha0.8:methodMLM            0.003 (0.003)       
## Constant                      0.002 (0.003)       
## --------------------------------------------------
## Observations                       810            
## R2                                0.273           
## Adjusted R2                       0.248           
## Residual Std. Error         0.014 (df = 783)      
## F Statistic             11.283*** (df = 26; 783)  
## ==================================================
## Note:                  *p<0.1; **p<0.05; ***p<0.01

We can quickly generate a lot of regression coefficients, making our meta-regression somewhat hard to interpret. The above model does not even have interactions of the simulation factors, even though the plots we have seen strongly suggest interactions among the simulation factors. That said, picking out the significant coefficients is a quick way to obtain clues as to what is driving performance. For instance, several features interact with the LR method for bias. The other two methods seem less impacted.

13.1.2.1 Using LASSO to simplify the model

We can simplify our model using LASSO regression, to drop coefficients that are less relevant. This requires some work to make our model matrix of dummy variables with all the interactions.

library(modelr)
library(glmnet)
## Loaded glmnet 4.1-10
# Define formula with all three-way interactions
form <- bias ~ ( n_bar + J + size_coef + ICC + alpha) * method

# Create model matrix
X <- model.matrix(form, data = sres_f)[, -1]  # drop intercept

# Fit LASSO
fit <- cv.glmnet(X, sres_f$bias, alpha = 1)

# Coefficients
coef(fit, s = "lambda.1se") %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  filter(abs(lambda.1se) > 0) %>%
  knitr::kable(digits = 3)
term lambda.1se
(Intercept) 0.000
size_coef0.2 0.002
alpha0.8 0.001
size_coef0.2:methodLR 0.008
alpha0.8:methodLR 0.007

When using regression, and especially LASSO, which levels are baseline can impact the final results. Here “Agg” is our baseline method, and so our coefficients are showing how other methods differ from the Agg method. If we selected LR as baseline, then we might suddenly see Agg and MLM as having large coefficients.

One trick is to give dummy variables for all the methods, and overload the method factor with the baseline method, so that it is always the first level.

form <- bias ~ 0 + ( n_bar + J + size_coef + ICC + alpha) * method
sres_f$method <- factor(sres_f$method)
vars = c("n_bar", "J", "size_coef", "alpha", "method")
contr.identity <- function(x) {
  n = nlevels(x)
  m <- diag(n)
  rownames(m) <- colnames(m) <- levels(x)

  m
}
contr.identity(sres_f$n_bar)
##     20 80 320
## 20   1  0   0
## 80   0  1   0
## 320  0  0   1
X <- model.matrix(~ 0 + ( n_bar + J + size_coef  + alpha) * method, 
                  data = sres_f,
                  contrasts.arg = lapply(sres_f[,vars], 
                                         \(x) contr.identity(x)))

colnames(X)
##  [1] "n_bar20"               
##  [2] "n_bar80"               
##  [3] "n_bar320"              
##  [4] "J5"                    
##  [5] "J20"                   
##  [6] "J80"                   
##  [7] "size_coef0"            
##  [8] "size_coef0.2"          
##  [9] "alpha0"                
## [10] "alpha0.5"              
## [11] "alpha0.8"              
## [12] "methodAgg"             
## [13] "methodLR"              
## [14] "methodMLM"             
## [15] "n_bar20:methodAgg"     
## [16] "n_bar80:methodAgg"     
## [17] "n_bar320:methodAgg"    
## [18] "n_bar20:methodLR"      
## [19] "n_bar80:methodLR"      
## [20] "n_bar320:methodLR"     
## [21] "n_bar20:methodMLM"     
## [22] "n_bar80:methodMLM"     
## [23] "n_bar320:methodMLM"    
## [24] "J5:methodAgg"          
## [25] "J20:methodAgg"         
## [26] "J80:methodAgg"         
## [27] "J5:methodLR"           
## [28] "J20:methodLR"          
## [29] "J80:methodLR"          
## [30] "J5:methodMLM"          
## [31] "J20:methodMLM"         
## [32] "J80:methodMLM"         
## [33] "size_coef0:methodAgg"  
## [34] "size_coef0.2:methodAgg"
## [35] "size_coef0:methodLR"   
## [36] "size_coef0.2:methodLR" 
## [37] "size_coef0:methodMLM"  
## [38] "size_coef0.2:methodMLM"
## [39] "alpha0:methodAgg"      
## [40] "alpha0.5:methodAgg"    
## [41] "alpha0.8:methodAgg"    
## [42] "alpha0:methodLR"       
## [43] "alpha0.5:methodLR"     
## [44] "alpha0.8:methodLR"     
## [45] "alpha0:methodMLM"      
## [46] "alpha0.5:methodMLM"    
## [47] "alpha0.8:methodMLM"

Now do the LASSO on this colinear mess:

fit <- cv.glmnet(X, sres_f$bias, alpha = 1)
coef(fit, s = "lambda.1se") %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  filter(abs(lambda.1se) > 0) %>%
  knitr::kable(digits = 3)
term lambda.1se
(Intercept) 0.003
size_coef0 -0.001
size_coef0.2 0.000
alpha0.8 0.000
size_coef0.2:methodLR 0.008
alpha0.8:methodLR 0.006

We can also extend to allow for pairwise interactions of simulation factors:

form2 <- bias ~ ( n_bar + J + size_coef + ICC + alpha)^2 * method

Interestingly, we get basically the same result:

term lambda.1se
(Intercept) 0.002
size_coef0.2:alpha0.8 0.002
size_coef0.2:methodLR 0.002
size_coef0.2:alpha0.8:methodLR 0.021

13.1.2.2 Fitting models to each method

We know each method responds differently to the simulation factors, so we could fit three models, one for each method, and compare them.

meth = c( "LR", "MLM", "Agg" )
models <- map( meth, function(m) {
  M <- lm( bias ~ (n_bar + J + size_coef + ICC + alpha)^2, 
      data = sres_f %>% filter( method == m ) )
  tidy( M )
} )

models <- 
  models %>% 
  set_names(meth) %>% 
  bind_rows( .id = "model" )

m_res <- models %>% 
  dplyr::select( model, term, estimate ) %>%
  pivot_wider( names_from="model", values_from="estimate" )

m_res %>%
  knitr::kable( digits = 2 )
term LR MLM Agg
(Intercept) 0.01 0.00 0.01
n_bar80 -0.01 -0.01 -0.01
n_bar320 0.00 0.00 0.00
J20 -0.01 0.00 -0.01
J80 -0.01 0.00 0.00
size_coef0.2 0.00 0.01 0.00
ICC 0.01 0.02 0.01
alpha0.5 -0.01 -0.01 -0.01
alpha0.8 -0.01 0.00 0.00
n_bar80:J20 0.01 0.01 0.01
n_bar320:J20 0.00 0.00 0.00
n_bar80:J80 0.01 0.01 0.01
n_bar320:J80 0.00 0.00 0.00
n_bar80:size_coef0.2 0.00 0.00 0.00
n_bar320:size_coef0.2 0.00 0.00 0.00
n_bar80:ICC 0.00 0.00 0.00
n_bar320:ICC -0.01 0.00 -0.01
n_bar80:alpha0.5 0.00 0.00 0.00
n_bar320:alpha0.5 0.00 0.00 0.01
n_bar80:alpha0.8 0.00 0.00 0.00
n_bar320:alpha0.8 0.00 0.00 0.00
J20:size_coef0.2 0.00 0.00 0.00
J80:size_coef0.2 0.00 0.00 -0.01
J20:ICC -0.01 -0.02 -0.01
J80:ICC 0.00 -0.01 0.00
J20:alpha0.5 0.01 0.01 0.01
J80:alpha0.5 0.01 0.01 0.01
J20:alpha0.8 0.02 0.01 0.01
J80:alpha0.8 0.01 0.00 0.00
size_coef0.2:ICC 0.01 -0.01 0.01
size_coef0.2:alpha0.5 0.02 0.00 0.00
size_coef0.2:alpha0.8 0.04 0.01 0.00
ICC:alpha0.5 -0.01 -0.01 -0.01
ICC:alpha0.8 0.00 -0.01 0.00

Of course, this is table is hard to read. Better to instead plot the coefficients or use LASSO to simplify the model specification.

m_resL <- m_res %>%
  pivot_longer( -term, 
                names_to = "model", values_to = "estimate" ) %>%
  mutate( term = factor(term, levels = unique(term)) ) %>%
  mutate( has_nbar = str_detect(term, "n_bar" ),
          has_J = str_detect(term, "J"),
          has_size_coef = str_detect(term, "size_coef"),
          has_ICC = str_detect(term, "ICC"),
          has_alpha = str_detect(term, "alpha") )

ggplot( m_resL,
        aes( x = term, y = estimate, 
             fill = model, group = model ) ) +
  facet_wrap( ~ has_nbar, scales="free_y" ) +
  geom_bar( stat = "identity", position = "dodge" ) +
  coord_flip()

Here we see how LR stands out, but also how MLM stands out under different simulation factor combinations. Staring at this provides some understanding of how the methods are similar, and dissimilar.

For another example we turn to the standard error. Here we regress \(log(SE)\) onto the coefficients, and we rescale ICC to be on a 5 point scale to control it’s relative coefficeint size to the dummy variables. We regress \(log(SE)\) and then exponentiate the coefficients to get the relative change in SE. We can then interpret an exponentiated coefficient of, 0.64 for MLM for n_bar80 as a 36% reduction of the standard error when we increase n_bar from the baseline of 20 to 80.

Here we make a plot like above, but with these relative changes:

This clearly shows that the methods are basically the same in terms of uncertainty estimation. We also see some interesting trends, such as the impact of n_bar declines when ICC is higher (see the interaction terms at rigth of plot).

13.2 Using regression trees to find important factors

With more complex experiments, where the various factors are interacting with each other in strange ways, it can be a bit tricky to decipher which factors are important and identify stable patterns. Another approach we might use to explore is to fit a regression tree on the simulation results.

source( here::here( "code/create_analysis_tree.R" ) )
## Loading required package: rpart
set.seed(4344443)
create_analysis_tree( sres_f,
                      outcome = "bias",
                      predictor_vars = c("method", "n_bar", "J",
                                         "size_coef", "ICC", "alpha"),
                      tree_title = "Cluster RCT Bias Analysis Tree" )

We will not walk through the tree code, but you can review it here. This function is a wrapper of the rpart package.

The default pruning is based on a cross-fitting evaluation, and our sample size is not too terribly high (just the number of simulation scenarios fit). Rerunning the code with a different seed can give a different tree. In general, it might be worth forcibly simplifying the tree. Trees are built greedily, so forcibly trimming often gives you the big things. For example:

create_analysis_tree( sres_f,
                      outcome = "bias",
                      predictor_vars = c("method", "n_bar", "J",
                                         "size_coef", "ICC", "alpha"),
                      tree_title = "Smaller Cluster RCT Bias Analysis Tree" )

A very straightforward story: if size_coef is not 0, we are using LR, and alpha is large, then we have large bias.

We can also zero in on specific methods to understand how they engage with the simulation factors, like so:

create_analysis_tree( filter( sres_f, method=="LR" ),
                      outcome = "bias",
                      min_leaves = 4,
                      predictor_vars = c("n_bar", "J",
                                         "size_coef", "ICC", "alpha"),
                      tree_title = "Drivers of Bias for LR method" )

We force more leaves to get at some more nuance. We again immediately see, for the LR method, that bias is large when we have non-zero size coefficient and large alpha value. Then, when \(J\) is small, bias is even larger.

Generally we would not use a tree like this for a final reporting of results, but they can be important tools for understanding your results, which leads to how to make and select more conventional figures for final reporting.

13.3 Analyzing results with few iterations per scenario

When your simulation iterations are expensive to run (i.e., when each model fitting takes several minutes), then running thousands of iterations for many scenarios may not be computationally feasible. But running simulations with a smaller number of iterations will yield very noisy estimates of estimator performance. For a given scenario, if the methods being evaluated are substantially different, then the main patterns in performance might become evident even with only a few iterations. More generally, however, the Monte Carlo Standard Errors (MCSEs) may be so large that you will have a hard time discriminating between systematic patterns and noise.

One tool to handle this is aggregation: if you use visualization methods that average across scenarios, those averages will have more precise estimates of (average) performance. Do not, by contrast, trust the bundling approaches–the MCSEs will make your boxes wider, and give the impression that there is more variation across scenarios than there really is. Regression approaches can be particularly useful: the regressions will effectively average performance across scenario, and give summaries of overall trends. You can even fit random effects regression, specifically accounting for the noise in the scenario-specific performance measures. For more on this approach see Gilbert and Miratrix (2024).

13.3.1 Example: ClusterRCT with only 100 replicates per scenario

In the prior chapter we analyzed the results of our cluster RCT simulation with 1000 replicates per scenario. But say we only had 100 per scenario. Using the prior chapter as a guide, we recreate some of the plots to show how MCSE can distort the picture of what is going on.

First, we look at our single plot of the raw results. Before we plot, however, we calculate MCSEs and add them to the plot as error bars.

sres_sub <- 
  sres %>%
  filter( n_bar == 320, J == 20 ) %>%
  mutate( bias.mcse = SE / sqrt( R ) )

ggplot( sres_sub, aes( as.factor(alpha), bias, 
                       col=method, pch=method, group=method ) ) +
  facet_grid( size_coef ~ ICC, labeller = label_both ) +
  geom_point() + 
  geom_errorbar( aes( ymin = bias - 2*bias.mcse, 
                      ymax = bias + 2*bias.mcse ), 
                 width = 0 ) +
  geom_line() +
  geom_hline( yintercept = 0 ) +
  theme_minimal() 

Aggregation should smooth out some of our uncertainty. When we aggregate across 9 scenarios, our number of replicates goes from 100 to 900; our MCSEs should be about a third the size. Here is our aggregated bias plot:

sres_sub2 <- 
  sres %>%
  mutate( bias.mcse = SE / sqrt( R ) ) %>%
  group_by( n_bar, J ) %>%
  summarise( 
    bias = mean( bias ),
    bias.mcse = sqrt( mean( bias.mcse^2 )) / sqrt(n()),
    .groups = "drop" 
  )

ggplot( sres_sub, aes( as.factor(alpha), bias, 
                       col=method, pch=method, group=method ) ) +
  facet_grid( size_coef ~ ICC, labeller = label_both ) +
  geom_point() + 
  geom_errorbar( aes( ymin = bias - 2*bias.mcse, 
                      ymax = bias + 2*bias.mcse ), 
                 width = 0 ) +
  geom_line() +
  geom_hline( yintercept = 0 ) +
  theme_minimal() 

To get aggregate MCSE, we aggregate our scenario-specific MCSEs as follows: \[ MCSE_{agg} = \sqrt{ \frac{1}{K^2} \sum_{k=1}^{K} MCSE_k^2 } \]

where \(MCSE_i\) is the Monte Carlo Standard Error for scenario \(i\), and \(k\) is the number of scenarios. Assuming a collection of estimates are independent, the overall \(SE^2\) of the average is the average \(SE^2\) divided by \(K\).

Even with the additional replicates per point, we see noticable noise in our plot. Note how our three methods track each other up and down in the zero-bias scenarios, giving a sense of a shared bias in some cases. This is because all methods are analyzing the same set of datasets; they have shared uncertainty. This uncertainty can be deceptive. It can also be a boon: if we are explicitly comparing the performance of one method vs another, the shared uncertainty can be subtracted out, similar to what happens in a blocked experiment (Gilbert and Miratrix 2024).

Here we fit a multilevel model to the data.

library(lme4)
sub_res <-
  res %>% 
  filter( runID <= 100 ) %>%
  mutate(
    error = ATE_hat - ATE,
    simID = paste(n_bar, J, size_coef, ICC, alpha, sep = "_"),
    J = as.factor(J),
    n_bar = as.factor(n_bar),
    alpha = as.factor(alpha),
    size_coef = as.factor(size_coef)
  )

M <- lmer( 
  error ~ method*(J + n_bar + ICC + alpha + size_coef) + (1|runID) + (1|simID),
  data = sub_res 
)
arm::display(M)
## lmer(formula = error ~ method * (J + n_bar + ICC + alpha + size_coef) + 
##     (1 | runID) + (1 | simID), data = sub_res)
##                        coef.est coef.se
## (Intercept)            -0.09     0.05  
## methodLR               -0.02     0.05  
## methodMLM               0.00     0.05  
## J20                     0.05     0.04  
## J80                     0.04     0.04  
## n_bar80                 0.06     0.04  
## n_bar320                0.04     0.04  
## ICC                    -0.03     0.06  
## alpha0.5               -0.01     0.04  
## alpha0.8                0.04     0.04  
## size_coef0.2            0.04     0.03  
## methodLR:J20            0.01     0.04  
## methodMLM:J20           0.00     0.04  
## methodLR:J80            0.02     0.04  
## methodMLM:J80           0.00     0.04  
## methodLR:n_bar80        0.01     0.04  
## methodMLM:n_bar80       0.00     0.04  
## methodLR:n_bar320       0.00     0.04  
## methodMLM:n_bar320      0.00     0.04  
## methodLR:ICC           -0.01     0.06  
## methodMLM:ICC          -0.01     0.06  
## methodLR:alpha0.5       0.00     0.04  
## methodMLM:alpha0.5      0.00     0.04  
## methodLR:alpha0.8       0.01     0.04  
## methodMLM:alpha0.8      0.00     0.04  
## methodLR:size_coef0.2   0.01     0.03  
## methodMLM:size_coef0.2  0.00     0.03  
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  simID    (Intercept) 0.19    
##  runID    (Intercept) 0.02    
##  Residual             0.33    
## ---
## number of obs: 2430, groups: simID, 270; runID, 3
## AIC = 2046.9, DIC = 1712.4
## deviance = 1849.6

We can look at the random effects:

ranef_vars <-
  as.data.frame(VarCorr(M)) %>%
  dplyr::select(grp = grp, sd = vcov) %>%
  mutate( sd = sqrt(sd),
          ICC = sd^2 / sum(sd^2 ) )

knitr::kable(ranef_vars, digits = 2)
grp sd ICC
simID 0.19 0.24
runID 0.02 0.00
Residual 0.33 0.75

The above model is a multilevel model that allows us to estimate how bias varies with method and simulation factor, while accounting for the uncertainty in the simulation. The random variation for simID captures unexplained variation due to the interactions of the simulation factors. We see a large value, indicating that many interactions are present, and our main effects are not fully capturing all trends.

13.4 Analyzing results when some trials have failed

If the method you are evaluating sometimes fail, then when they tend to fail is something to investigate in its own right. Ideally, failure would not be too common, meaning we could drop those trials, or keep them, without really impacting our overall results. But one should at least know what one is ignoring.

For example, in our cluster RCT running example, we know that our multilevel model, at least sometimes, has convergence issues. We also know that ICC is an important driver of when these convergence issues might occur, so we can explore how often we get a convergence message by ICC level:

res %>% 
  group_by( method, ICC ) %>%
  summarise( message = mean( message ) ) %>%
  pivot_wider( names_from = "method", values_from="message" )
## Warning: There were 15 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `message = mean(message)`.
## ℹ In group 1: `method = "Agg"` `ICC = 0`.
## Caused by warning in `mean.default()`:
## ! argument is not numeric or logical: returning NA
## ℹ Run `dplyr::last_dplyr_warnings()` to see the
##   14 remaining warnings.
## # A tibble: 5 × 4
##     ICC   Agg    LR   MLM
##   <dbl> <dbl> <dbl> <dbl>
## 1   0      NA    NA    NA
## 2   0.2    NA    NA    NA
## 3   0.4    NA    NA    NA
## 4   0.6    NA    NA    NA
## 5   0.8    NA    NA    NA

We see that when the ICC is 0 we get a lot of convergence issues, but as soon as we pull away from 0 it drops off considerably. At this point we might decide to drop those runs with a message or keep them. In this case, we decide to keep. It should not matter much, except possibly when ICC = 0, and we know the convergence issues are driven by trying to estimate a 0 variance, and thus is in some sense expected. Furthermore, we know people using these methods would likely ignore these messages, and thus we are faithfully capturing how these methods would be used in practice. We might eventually, however, want to do a separate analysis of the ICC = 0 context to see if the MLM approach is actually falling apart, or if it is just throwing warnings.

References

Gilbert, Joshua, and Luke Miratrix. 2024. “Multilevel Metamodels: A Novel Approach to Enhance Efficiency and Generalizability in Monte Carlo Simulation Studies.” arXiv Preprint arXiv:2401.07294.