# Chapter 9 Performance criteria

Once we have run our simulation, we have a pile of results to sort through. Given these results, the question is now: how do we assess the performance of our evaluated estimation procedures?

In this chapter, we look at a variety of performance criteria that are commonly used to compare the relative performance of multiple estimators or measure how well an estimator works. These performance criteria are all assessments of how the estimator behaves if you repeat the experimental process an infinite number of times. In statistical terms, these criteria are summaries of the true sampling distribution of the estimator, given a specified data generating process. For example, the bias of an estimator in a given scenario is how far off, on average, the estimator would be from the true parameter value if you repeated the experiment an infinite number of times.

Although we cannot observe the sampling distribution of an estimator directly (and it can only rarely be worked out in full mathematical detail), the set of estimates generated by a simulation constitute a (typically large) sample from the sampling distribution of the studied estimator. (Say that six times fast!) We can then use that sample to estimate the performance criteria of interest. For example, when we wanted to know what percent of the time we would reject the null hypothesis (for a given, specified situation) we estimated that by seeing how often we rejected in 1000 trials.

Now, because we have only a sample of trials rather than the full distribution, our estimates are merely estimates. In other words, they can be wrong, just due to random chance. We can describe how wrong with the Monte Carlo standard error (MCSE). The MCSE is the standard error of our estimate of performance due to the simulation only having a finite number of trials. Just as with statistical uncertainty when analyzing data, we can estimate our MCSE and even use them to generate confidence intervals for our performance estimates. The MCSE is not related to the estimators being evaluated; the MCSE is a function of how much simulation we can do. In a simulation study, we could, in theory, know exactly how well our estimators do for a given context, if we ran an infinite number of simulations; the MCSE tells us how far we are from this ideal, given how many simulation trials we actually ran. Given a desired MCSE, we could similarly determine how many replications were needed to ensure our performance estimates have a desired level of precision.

## 9.1 Inference vs. Estimation

There are two general classes of analysis one typically does with data: inference and estimation. To illustrate, we continue to reflect on the question of best practices for analyzing a cluster randomized experiment. For this problem, we are focused on the estimand of the site-average treatment effect, $$\gamma_1$$. The estimand is the thing we are trying to estimate. We can ask whether $$\gamma_1$$ is non-zero (inference), and we can further ask what $$\gamma_1$$ actually is (estimation). More expanded we have:

Inference is when we do hypothesis testing, asking whether there is evidence for some sort of effect, or asking whether there is evidence that some coefficient is greater than or less than some specified value. In particular, for our example, to know if there is evidence that there is an average treatment effect at all we would test the null of $$H_0: \gamma_1 = 0$$.

Estimation is when we try to measure the size of an estimand such as an actual average treatment effect $$\gamma_1$$. Estimation has two major components, the point estimator and the uncertainty estimator. We generally evaluate both the properties of the point estimator and the performance of the properties of the point estimator. For example, consider a specific estimate $$\hat{\gamma_1}$$ of our average treatment effect. We first wish to know the actual bias and true standard error ($$SE$$) of $$\hat{\gamma_1}$$. These are its actual properties. However, for each estimated $$\hat{\gamma_1}$$, we also estimate $$\widehat{SE}$$, as our estimated measure of how precise our estimate is. We need to understand the properties of $$\widehat{SE}$$ as well.

Inference and estimation are clearly highly related–if we have a good estimate of the treatment effect and it is not zero, then we are willing to say that there is a treatment effect–but depending on the framing, the way you would set up a simulation to investigate the behavior of your estimators could be different.

## 9.2 Ways of Assessing and Comparing Estimation Procedures

We often have different methods for obtaining some estimate, and we often want to know which is best. For example, comparison is the core question behind our running example of identifying which estimation strategy (aggregation, linear regression, or multilevel modeling) we should generally use when analyzing cluster randomized trial data. The goal of a simulation comparing our estimators would be to identify whether our estimation strategies were different, whether one was superior to the other (and when), and what the salient differences were. To fully understand the trade-offs and benefits, we would examine and compare the properties of our different approaches across a variety of circumstances, and with respect to a variety of metrics of success.

For inference, we first might ask whether our methods are valid, i.e., ask whether the methods work correctly when we test for a treatment effect when there is none. For example, we might wonder whether using multilevel models could open the door to inference problems if we had model misspecification, such as in a scenario where the residuals had some non-normal distribution. These sorts of questions are questions of validity.

Also for inference, we might ask which method is better for detecting an effect when there is one. Here, we want to know how our estimators perform in circumstances with a non-zero average treatment effect. Do they reject the null often, or rarely? How much does using aggregation decrease (or increase?) our chances of rejection? These are questions about power.

For estimation, we generally are concerned with two things: bias and variance. An estimator is biased if it would generally give estimates that are systematically higher (or lower) than the parameter being estimated in a given scenario. The variance of an estimator is a measure of how much the estimates vary from trial to trial. The variance is the true standard error, squared.

We might also be concerned with how well we can estimate the uncertainty of our estimators (i.e., estimate our standard error). For example, we might have an estimator that works very well, but we have no ability to estimate how well. Continuing our example, we might want to examine how well, for example, the standard errors we get from aggregation work as compared to the standard errors we get out of our linear regression approach.

Finally, we might want to know how well confidence intervals based on our methods work. Do the intervals capture the true estimands with the desired level of accuracy, across the simulation trials? Are the intervals for one method generally shorter or longer than those of another?

## 9.3 Assessing a Point Estimator

Assessing the actual properties of a point estimator is generally fairly simple. For a given scenario, repeatedly generate data and estimate effects. Then take summary measures such as the mean and standard deviation of these repeated trials’ estimates to estimate the actual properties of the estimator via Monte Carlo. Given sufficient simulation trials, we can obtain arbitrarily accurate measures.

The most common measures of an estimator are the bias, variance, and mean squared error. For example, we can ask what the variance (or standard error) of our estimator is. We can ask if our estimator is biased. We can ask what the overall $$RMSE$$ (root mean squared error) of our estimator is.

To be more formal, consider an estimator $$T$$ that is targeting a parameter $$\theta$$. A simulation study generates a (typically large) sample of estimates $$T_1,...,T_R$$, all of the target $$\theta$$. We know $$\theta$$ because we generated the data.

We can first assess whether our estimator is biased, by comparing the mean of our $$R$$ estimates $\bar{T} = \frac{1}{R}\sum_{r=1}^R T_r$ to $$\theta$$. The bias of our estimator is $$bias = \bar{T} - \theta$$.

We can also ask how variable our estimator is, by calculating the variance of our $$R$$ estimates $\displaystyle{S_T^2 = \frac{1}{R - 1}\sum_{r=1}^R \left(T_r - \bar{T}\right)^2} .$

The square root of this, $$S_T$$ is the true standard error of our estimator (up to Monte Carlo simulation uncertainty).

Finally, the Root Mean Square Error (RMSE) is a combination of the above two measures: $RMSE = \left\{ \frac{1}{R} \sum_{r = 1}^R \left( T_r - \theta\right)^2 \right\}^{1/2}.$ Often people talk about the MSE (Mean Squared Error)–this is just the RMSE squared.

An important relationship connecting these three measures is $RMSE^2 = bias^2 + variance = bias^2 + SE^2 .$ It is important to clarify an important point: the true standard error of an estimator $$\hat{\gamma_1}$$ is the standard deviation of $$\hat{\gamma_1}$$ across multiple datasets. In practice, we never know this value, but in a simulation we can obtain it as the standard deviation of our simulation trial estimates. People generally, when they say “Standard Error” mean estimated Standard Error, ($$\widehat{SE}$$) which is when one uses the emperical data to estimate $$SE$$. For assessing actual properties, we have the true standard error (up to Monte Carlo simulation error).

For absolute assessments of performance, an estimator with low bias, low variance, and thus low RMSE is desired. For comparisons of relative performance, an estimator with lower RMSE is usually preferable to an estimator with higher RMSE; if two estimators have comparable RMSE, then the estimator with lower bias would usually be preferable.

It is important to recognize that the above performance measures depend on the scale of the parameter. For example, if our estimators are measuring a treatment impact in dollars, then our bias, SE, and RMSE are all in dollars. The variance and MSE would be in dollars squared, which is why we take their square roots to put them back on an intepretable dollars scale.

Usually in a simulation, the scale of the outcome is irrelevant as we are comparing one estimator to the other. To ease interpretation, we might want to assess estimators relative to the baseline variation. To achieve this, we can generate data so the outcome has unit variance (i.e., we generate standardized data). Then the bias, median bias, and root mean-squared error would all be in standard deviation units.

By contrast, a nonlinear change of scale of a parameter can lead to nonlinear changes in the performance measures. For instance, suppose that $$\theta$$ is a measure of the proportion of time that a behavior occurs. A natural way to transform this parameter would be to put it on the log-odds (logit) scale. However, because of the nonlinear aspect of the logit, $\text{Bias}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{Bias}[T]\right), \qquad \text{RMSE}\left[\text{logit}(T)\right] \neq \text{logit}\left(\text{RMSE}[T]\right),$ and so on. This is fine, but one should be aware that this can happen and do it on purpose.

### 9.3.1 Comparing the Performances of the Cluster RCT Estimation Procedures

Given our simulation results generated in the last chapter, we next assess the bias, standard error, and RMSE of our three different estimators of the ATE. These performance criteria address these primary questions:

• Is the estimator systematically off? (bias)
• Is it precise? (standard error)
• Does it predict well? (RMSE)

Let us see how the three estimators compare on these criteria.

Are the estimators biased? Bias is with respect to a target estimand. Here we assess whether our estimates are systematically different from the $$\gamma_1$$ parameter we used to generate the data (this is the ATE parameter, which we had set to 0.30).

runs %>%
group_by( method ) %>%
summarise(
mean_ATE_hat = mean( ATE_hat ),
bias = mean( ATE_hat - ATE )  )
## # A tibble: 3 × 3
##   method mean_ATE_hat    bias
##   <chr>         <dbl>   <dbl>
## 1 Agg           0.306 0.00561
## 2 LR            0.390 0.0899
## 3 MLM           0.308 0.00788

Linear regression, with a bias of about 0.09 effect size units, appears about ten times as biased as the other estimators. There is no evidence of major bias for Agg or MLM. This is because the linear regression is targeting the person-average average treatment effect. Our data generating process makes larger sites have larger effects, so the person average is going to be higher since those larger sites will count more. Our estimand, by contrast, is the site average treatment effect, i.e., the simple average of each site’s true impact, which our DGP has set to 0.30. The Agg and MLM methods, by contrast, estimate this site-average effect, putting them in line with our DGP.

If we had instead decided our target estimand was the person average effect, then we would see linear regression as unbiased, and Agg and MLM as biased; it is important to think carefully about what the estimators are targeting, and report bias with respect to a clearly articulated goal.

Which method has the smallest standard error? The true Standard Error is simply how variable a point estimator is, and is calculated as the standard deviation of the point estimates for a given estimator. The Standard Error reflects how stable our estimates are across datasets that all came from the same data generating process. We calculate the standard error, and also the relative standard error using linear regression as a baseline:

true_SE <- runs %>%
group_by( method ) %>%
summarise(
SE = sd( ATE_hat )
)
true_SE %>%
mutate( per_SE = SE / SE[method=="LR"] )
## # A tibble: 3 × 3
##   method    SE per_SE
##   <chr>  <dbl>  <dbl>
## 1 Agg    0.168  0.916
## 2 LR     0.183  1
## 3 MLM    0.168  0.916

These standard errors are all what we would be trying to estimate with a standard error estimator in a normal data analysis. The other methods appear to have SEs about 8% smaller than Linear Regression.

Which method has the smallest Root Mean Squared Error?

So far linear regression is not doing well: it has more bias and a larger standard error than the other two. We can assess overall performance by combining these two quantities with the RMSE:

runs %>%
group_by( method ) %>%
summarise(
bias = mean( ATE_hat - ATE ),
SE = sd( ATE_hat ),
RMSE = sqrt( mean( (ATE_hat - ATE)^2 ) )
) %>%
mutate( per_RMSE = RMSE / RMSE[method=="LR"] )
## # A tibble: 3 × 5
##   method    bias    SE  RMSE per_RMSE
##   <chr>    <dbl> <dbl> <dbl>    <dbl>
## 1 Agg    0.00561 0.168 0.168    0.823
## 2 LR     0.0899  0.183 0.204    1
## 3 MLM    0.00788 0.168 0.168    0.823

We also include SE and bias as reference.

RMSE is a way of taking both bias and variance into account, all at once. For Agg and MLM, the RMSE is basically the standard error; this makes sense as they are not biased. For LR, the combination of bias plus increased variability gives a higher RMSE. That said, clearly the standard error dominates the bias term (note how RMSE and SE are more similar than RMSE and bias). This is especially the case as RMSE is the square root of the bias and standard errors squared; this makes difference between them even more extreme. Overall, Agg and MLM have RMSEs around 16% smaller than LR–this seems notable.

### 9.3.2 Handling Estimands Not Represented By a Parameter

In our Cluster RCT example, we have focused on an estimand of the ATE as captured by our model parameter $$\gamma_1$$. But say we were interested in the person-average effect. This is not represented by any number, so we would have to calculate it and then compare all of our estimates to it.

We offer two ways of doing this. The first is to simply generate a massive dataset, and then average across it to get a good estimate of the true person-average effect. If our dataset is big enough, then the uncertainty in this estimate will be neglidgable compared to the uncertainty in our simulation.

Here we try this:

dat = gen_dat_model( n_bar = 30, J=100000,
gamma_1 = 0.3, gamma_2 = 0.5,
sigma2_u = 0.20, sigma2_e = 0.80,
alpha = 0.75  )
ATE_person = mean( dat$Yobs[dat$Z==1] ) - mean( dat$Yobs[dat$Z==0] )
ATE_person
## [1] 0.395412

Note our estimate of the person-average effect of 0 is about what we would expect given the bias of the linear model!

Note how bias and RMSE have shifted, but SE is the same, when we compare to ATE_person:

runs %>%
group_by( method ) %>%
summarise(
bias = mean( ATE_hat - ATE_person ),
SE = sd( ATE_hat ),
RMSE = sqrt( mean( (ATE_hat - ATE_person)^2 ) )
) %>%
mutate( per_RMSE = RMSE / RMSE[method=="LR"] )
## # A tibble: 3 × 5
##   method     bias    SE  RMSE per_RMSE
##   <chr>     <dbl> <dbl> <dbl>    <dbl>
## 1 Agg    -0.0898  0.168 0.190     1.04
## 2 LR     -0.00554 0.183 0.183     1
## 3 MLM    -0.0875  0.168 0.189     1.03

We see Agg and MLM are now biased, and LR is unbiased. RMSE is now a tension between bias and reduced variance. Overall, Agg and MLM are 4% worse than LR in terms of RMSE, because they have lower SEs but more bias.

The second method of calculating ATE_person would be to record the true person average effect of the dataset of each simulation iteration, and then average those at the end. To do this we would need to modify our gen_dat_model() DGP code to track this additional information. We might have, for example

tx_effect = gamma_1 + gamma_2 * (nj-n_bar)/n_bar
beta_0j = gamma_0 + Zj * tx_effect + u0j

and then we would return tx_effect as well as Yobs and Z as a column in our dataset. This is similar to directly calculating potential outcomes, as discussed in Chapter 23.

Once we modified our DGP code, we also need to modify our analysis functions to record this information. We might have, for example:

analyze_data = function( dat ) {
MLM = analysis_MLM( dat )
LR = analysis_OLS( dat )
Agg = analysis_agg( dat )
res <- bind_rows( MLM = MLM, LR = LR, Agg = Agg,
.id = "method" )
res$ATE_person = mean( dat$tx_effect )
return( res )
}

Now when we run our simulation, we would have a column which is the true person average treatment effect for each dataset. We could then take the average of those across our datasets to estimate the true person average treatment effect in the population, and then compare our point estimators to that value.

Clearly, an estimand that is not represented by a parameter is more difficult to work with, but it is not impossible. As always, be clear as to what you are trying to estimate.

## 9.4 Assessing a Standard Error Estimator

Statistics is perhaps more about assessing how good an estimate is than making an estimate in the first place. This translates to simulation studies: in our simulation we can know an estimator’s actual properties, but if we were to use this estimator in practice we would have to also estimate its associated standard error, and generate confidence intervals and so forth using this standard error estimate. To understand if this would work in practice, we would need to evaluate not only the behavior of the estimator itself, but the behavior of these associated things. In other words, we generally not only want to know whether our point estimator is doing a good job, but we usually want to know whether we are able to get a good standard error for that point estimator as well.

To do this we first compare the expected value of $$\widehat{SE}$$ (estimated with the average $$\widehat{SE}$$ across our simulation trials) to the actual $$SE$$. This tells us whether our uncertainty estimates are biased. We could also examine the standard deviation of $$\widehat{SE}$$ across trials, which tells us whether our estimates of uncertainty are relatively stable. We finally could examine whether there is correlation between $$\widehat{SE}$$ and the actual error (e.g., $$\left|T - \theta \right|$$). Good estimates of uncertainty should predict error in a given context (especially if calculating conditional estimates); see Sundberg (2003).

For the first assessment, we usually assess the quality of a standard error estimator with a relative performance criteria, rather than an absolute one, meaning we compare the estimated standard error to the true standard error as a ratio.

For an example, suppose that in our simulation we are examining the performance of a point-estimator $$T$$ for a parameter $$\theta$$ along with an estimator $$\widehat{SE}$$ for the standard error of $$T$$. In this case, we likely do not know the true standard error of $$T$$, for our simulation context, prior to the simulation. However, we can use the variance of $$T$$ across the replications ($$S_T^2$$) to directly estimate the true sampling variance $$\text{Var}(T) = SE^2(T)$$. The relative bias of $$\widehat{SE}^2$$ would then be estimated by $$RB = \bar{V} / S_T^2$$, where $$\bar{V}$$ is the average of $$\widehat{SE}^2$$ across simulation runs. Note that a value of 1 for relative bias corresponds to exact unbiasedness of the variance estimator. The relative bias measure is a measure of proportionate under- or over-estimation. For example, a relative bias of 1.12 would mean the standard error was, on average, 12% too large. We discuss relative performance measures further in Section 9.7.1.

### 9.4.1 Why Not Assess the Estimated SE directly?

We typically see assessment of $$\widehat{SE}^2$$, not $$\widehat{SE}$$. In other words, we typically work with assessing whether the variance estimator is unbiased, etc., rather than the standard error estimator. This comes out of a few reasons. First, in practice, so-called unbiased standard errors usually are not in fact actually unbiased (see the delightfully titled section 11.5, “The Joke Is on Us: The Standard Deviation Estimator is Biased after All,” in Westfall and Henning (2013) for further discussion). For linear regression, for example, the classic standard error estimator is an unbiased variance estimator, meaning that we have a small amount of bias due to the square-rooting because:

$E[ \sqrt{ V } ] \neq \sqrt{ E[ V ] } .$

Variance is also the component that gives us the classic bias-variance breakdown of $$MSE = Variance + Bias^2$$, so if we are trying to assign whether an overall MSE is due to instability or systematic bias, operating in this squared space may be preferable.

That being said, to put things in terms of performance criteria humans understand, it is usually nicer to put final evaluation metrics back into standard error units. For example, saying there is a 10% reduction in the standard error is more meaningful (even if less impressive sounding) than saying there is a 19% reduction in the variance.

### 9.4.2 Assessing SEs for Our Cluster RCT Simulation

To assess whether our estimated SEs are about right, we can look at the average estimated (squared) standard error and compare it to the true standard error. Our standard errors are inflated if they are systematically larger than they should be, across the simulation runs. We can also look at how stable our standard error estimates are, by taking the standard deviation of our standard error estimates. We interpret this quantity relative to the actual standard error to get how far off, as a percent of the actual standard error, we tend to be.

runs %>%  group_by( method ) %>%
summarise(
SE = sd( ATE_hat ),
mean_SEhat = sqrt( mean( SE_hat^2 ) ),
infl = 100 * mean_SEhat / SE,
sd_SEhat = sd( SE_hat ),
stability = 100 * sd_SEhat / SE )
## # A tibble: 3 × 6
##   method    SE mean_SEhat  infl sd_SEhat stability
##   <chr>  <dbl>      <dbl> <dbl>    <dbl>     <dbl>
## 1 Agg    0.168      0.174  104.   0.0232      13.8
## 2 LR     0.183      0.185  101.   0.0309      16.8
## 3 MLM    0.168      0.174  104.   0.0232      13.8

The SEs for Agg and MLM appear to be a bit conservative on average. (3 or 4 percentage points too big).

The last column (stability) shows how variable the standard error estimates are relative to the true standard error. 50% would mean the standard error estimates can easily be off by 50% of the truth, which would not be particularly good. Here we see the linear regression is more unstable than the other methods (cluster-robust standard errors are generally known to be a bit unstable, so this is not too surprising). It is a bad day for linear regression.

## 9.5 Assessing an Inferential Procedure (Hypothesis Testing)

When hypothesis tests are used in practice, the researcher specifies a null (e.g., no treatment effect), collects data, and generates a $$p$$-value, which is a measure of how extreme the observed data are from what we would expect to naturally occur, if the null were true. When we assess a method for hypothesis testing, we are therefore typically concerned with two aspects: validity and power.

### 9.5.1 Validity

Validity revolves around whether we erroneously reject a true null more than we should. Put another way, we say an inference method is valid if it has no more than an $$\alpha$$ chance of rejecting the null, when it is true, when we are testing at the $$\alpha$$ level. This means if we used this method 1000 times, where the null was true for all of those 1000 times, we should not see more than about $$1000 \alpha$$ rejections (so, 50, if we were using the classic $$\alpha = 0.05$$ rule).

To assess validity we would therefore specify a data generating process where the null is in fact true. We then, for a series of such data sets with a true null, conduct our inferential processes on the data, record the $$p$$-value, and score whether we reject the null hypothesis or not.

We might then test our methods by exploring more extreme data generation processes, where the null is true but other aspects of the data (such as outliers or heavy skew) make estimation difficult. This allows us to understand if our methods are robust to strange data patterns in finite sample contexts.

The key concept for validity is that the date we generate, no matter how we do it, must be data with a true null. The check is always then to see if we reject the null more than we should.

### 9.5.2 Power

Power is, loosely speaking, how often we notice an effect when one is there. Power is a much more nebulous concept than validity, because some effects (e.g. large effects) are clearly easier to notice than others. If we are comparing estimators to each other, the overall chance of noticing is less of a concern, because we are typically interested in relative performance. That being said, in order to generate data for a power evaluation, we have to generate data where there is something to detect. In other words, we need to commit to what the alternative is, and this can be a tricky business.

Typically, we think of power as a function of sample size or effect size. Therefore, we will typically examine a sequence of scenarios with steadily increasing sample size or effect size, estimating the power for each scenario in the sequence.

We then, for each sample in our series, estimate the power by the same process as for validity, above. When assessing validity, we want rejection rates to be low, below $$\alpha$$, and when assessing power we want them to be as high as possible. But the simulation process itself, other than the data generating process, is exactly the same.

### 9.5.3 The Rejection Rate

To put some technical terms to this framing, for both validity and power assessment the main performance criterion is the rejection rate of the hypothesis test. When the data are simulated from a model in which the null hypothesis being tested is true, then the rejection rate is equivalent to the Type-I error rate of the test. When the data are simulated from a model in which the null hypothesis is false, then the rejection rate is equivalent to the power of the test (for the given alternate hypothesis represented by the DGP). Ideally, a testing procedure should have actual Type-I error equal to the nominal level $$\alpha$$ (this is the definition of validity), but such exact tests are rare.

There are some different perspectives on how close the actual Type-I error rate should be in order to qualify as suitable for use in practice. Following a strict statistical definition, a hypothesis testing procedure is said to be level-$$\alpha$$ if its actual Type-I error rate is always less than or equal to $$\alpha$$. Among a set of level-$$\alpha$$ tests, the test with highest power would be preferred. If looking only at null rejection rates, then the test with Type-I error closest to $$\alpha$$ would usually be preferred. A less stringent criteria is sometimes used instead, where type I error would be considered acceptable if it is within 50% of the desired $$\alpha$$.

Often, it is of interest to evaluate the performance of the test at several different $$\alpha$$ levels. A convenient way to calculate a set of different rejection rates is to record the simulated $$p$$-values and then calculate from those. To illustrate, suppose that $$P_r$$ is the $$p$$-value from simulation replication $$k$$, for $$k = 1,...,R$$. Then the rejection rate for a level-$$\alpha$$ test is defined as $$\rho_\alpha = \text{Pr}\left(P_r < \alpha\right)$$ and estimated as, using the recorded $$p$$-values, $r_\alpha = \frac{1}{R} \sum_{r=1}^R I(P_r < \alpha).$

For a null DGP, one can also plot the emperical cumulative density function of the $$p$$-values; a valid test should give a $$45^\circ$$ line as the $$p$$-values should be standard uniform in distribution.

### 9.5.4 Inference in our Cluster RCT Simulation

For our scenario, we generated data with an actual treatment effect. Without further simulation, we therefore could only assess power, not validity. This is easily solved! We simply rerun our simulation code that we made last chapter with simhelpers, but with setting ATE = 0.

set.seed( 404044 )
runs_val <- sim_function( R, n_bar = 30, J = 20, gamma_1 = 0 )
saveRDS( runs_val, file = "results/cluster_RCT_simulation_validity.rds" )

Assessing power and validity is exactly the same calculation: we see how often we have a $$p$$-value less than 0.05. For power we have:

runs %>% group_by( method ) %>%
summarise( power = mean( p_value <= 0.05 ) )
## # A tibble: 3 × 2
##   method power
##   <chr>  <dbl>
## 1 Agg    0.376
## 2 LR     0.503
## 3 MLM    0.383

For validity:

runs_val %>% group_by( method ) %>%
summarise( power = mean( p_value <= 0.05 ) )
## # A tibble: 3 × 2
##   method power
##   <chr>  <dbl>
## 1 Agg    0.051
## 2 LR     0.059
## 3 MLM    0.048

The power when there is an effect (for this specific scenario) is not particularly high, and the validity is around 0.05, as desired.

Linear regression has notabily higher power… but this may be in part due to the invalidity of the test (note the rejection rate is around 6%, rather than the target of 5%). The elevated power is also likely due to the upward bias in estimation. As discussed above, LR is targeting the person-average impact which, in this case, is not 0 even under our null because we have kept our impact heterogeniety parameter to its default of $$\gamma_2=0.2$$, meaning we have treatment variation around 0. We could run our simulation with truly null effects to see if the false rejection rate goes down.

## 9.6 Assessing Confidence Intervals

Some estimation procedures result in confidence intervals (or sets) which are ranges of values that should contain the true answer with some specified degree of confidence. For example, a normal-based confidence interval is a combination of an estimator and its estimated uncertainty.

We typically score a confidence interval along two dimensions, coverage rate and average length. To calculate coverage rate, we score whether each interval “captured” the true parameter. A success is if the true parameter is inside the interval. To calculate average length, we record each confidence interval’s length, and then average across simulation runs. We say an estimator has good properties if it has good coverage, i.e. it is capturing the true value at least $$1-\alpha$$ of the time, and if it is generally short (i.e., the average length of the interval is less than the average length for other methods).

Confidence interval coverage is simultaneously evaluating the estimators in terms of how well they estimate (precision) and their inferential properties. We have combined inference and estimation.

Suppose that the confidence intervals are for the target parameter $$\theta$$ and have coverage level $$\beta$$. Let $$A_r$$ and $$B_r$$ denote the lower and upper end-points of the confidence interval from simulation replication $$r$$, and let $$W_r = B_r - A_r$$, all for $$r = 1,...,R$$. The coverage rate $$\omega_\beta$$ and average length $$\text{E}(W)$$ criteria are then as defined in the table below.

Criterion Definition Estimate
Coverage $$\omega_\beta = \text{Pr}(A \leq \theta \leq B)$$ $$\frac{1}{R}\sum_{r=1}^R I(A_r \leq \theta \leq B_r)$$
Expected length $$\text{E}(W) = \text{E}(B - A)$$ $$\bar{W} = \bar{B} - \bar{A}$$

Just as with hypothesis testing, a strict statistical interpretation would deem a hypothesis testing procedure acceptable if it has actual coverage rate greater than or equal to $$\beta$$. If multiple tests satisfy this criterion, then the test with the lowest expected length would be preferable. Some analysts prefer to look at lower and upper coverage separately, where lower coverage is $$\text{Pr}(A \leq \theta)$$ and upper coverage is $$\text{Pr}(\theta \leq B)$$.

### 9.6.1 Confidence Intervals in our Cluster RCT Example

For our CRT simulation, we first have to calculate confidence intervals, and then assess coverage. We could have used methods such as confint() in the estimation approaches; this would be preferred if we wanted more accurately calculated confidence intervals that used $$t$$-distributions and so forth to account for the moderate number of clusters.

But if we want to use normal assumption confidence intervals we can calculate them post-hoc:

runs %>% mutate( CI_l = ATE_hat - 1.96*SE_hat,
CI_h = ATE_hat + 1.96*SE_hat,
covered = CI_l <= ATE & ATE <= CI_h,
width = CI_h - CI_l ) %>%
group_by( method ) %>%
summarise( coverage = mean( covered ),
width = mean( width ))
## # A tibble: 3 × 3
##   method coverage width
##   <chr>     <dbl> <dbl>
## 1 Agg       0.942 0.677
## 2 LR        0.908 0.717
## 3 MLM       0.943 0.677

Our coverage is about right for Agg and MLM, and around 5 percentage points too low for LR. Linear regression is taking a hit from the bias term. The CIs of LR are a bit wider than the other methods due to the estimated SEs being slightly larger.

## 9.7 Additional Thoughts on Measuring Performance

In this section we provide some additional thoughts on performance measures. We first discuss relative vs. absolute criteria some more, then touch on robust measures of performance. We finally summarize the measures we discuss in this chapter.

### 9.7.1 Selecting Relative vs. Absolute Criteria

We have primarily examined performance estimators for point estimators using absolute criteria, focusing on measures like bias directly on the scale of the outcome. In contrast, for evaluation things such as estimated standard errors, which are always positive and scale-dependent, it often makes sense to use relative criteria, i.e., criteria calculated as proportions of the target parameter ($$T/\theta$$) rather than as differences ($$T - \theta$$). We typically apply absolute criteria to point estimators and relative criteria to standard error estimators (we are setting aside, for the moment, the relative criteria of a measure from one estimation procedure to another, as we saw earlier when we compared the SEs to a baseline SE of linear regression for the cluster randomized trial simulation. So how do we select when to use what?

As a first piece of guidance, establish whether we expect the performance (e.g., bias, standard error, or RMSE) of a point estimate to depend on the magnitude of the estimand. For example, if we are estimating some mean $$\theta$$, and we generate data where $$\theta = 100$$ vs where $$\theta = 1000$$ (or any other arbitrary number), we would not generally expect the value of $$\theta$$ to change the magnitude of bias, variance, or MSE. On the other hand, these different $$\theta$$s would have a large impact on the relative bias and relative MSE. (Want smaller relative bias? Just add a million to the parameter!) For these sorts of “location parameters” we generally use absolute measures of performance.

That being said, a more principled approach for determining whether to use absolute or relative performance criteria depends on assessing performance for multiple values of the parameter. In many simulation studies, replications are generated and performance criteria are calculated for several different values of a parameter, say $$\theta = \theta_1,...,\theta_p$$. Let’s focus on bias for now, and say that we’ve estimated (from a large number of replications) the bias at each parameter value. We present two hypothetical scenarios, A and B, in the figures below.

If the absolute bias is roughly the same for all values of $$\theta$$ (as in Scenario A), then it makes sense to report absolute bias as the summary performance criterion. On the other hand, if the bias grows roughly in proportion to $$\theta$$ (as in Scenario B), then relative bias might be a better summary criterion.

Performance relative to a baseline estimator.

Another relative measure, as we saw earlier, is to calculate performance relative to some baseline. For example, if one of the estimators is the “generic method,” we could calculate ratios of the RMSE of our estimators to the baseline RMSE. This can provide a way of standardizing across simulation scenarios where the overall scale of the RMSE changes radically. This could be critical to, for example, examining trends across simulations that have different sample sizes, where we would expect all estimators’ performance measures to improve as sample size grows. This kind of relative standardization allows us to make statements such as “Aggregation has standard errors around 8% smaller than linear regression”–which is very interpretable, more interpretable than saying “Aggregation has standard errors around 0.01 smaller than linear regression.” In the latter case, we do not know if that is big or small.

While a powerful tool, standardization is not without risks: if you scale relative to something, then higher or lower ratios can either be due to the primary method of interest (the numerator) or due to the behavior of the reference method in the denominator. These relative ratios can end up being confusing to interpret due to this tension.

They can also break when everything is on a constrained scale, like power. If we have a power of 0.05, and we improve it to 0.10, we have doubled our power, but if it is 0.10 and we increase to 0.15, we have only increased by 50%. Ratios when near zero can be very deceiving.

### 9.7.2 Robust Measures of Performance

Depending on the model and estimation procedures being examined, a range of different criteria might be used to assess estimator performance. For point estimation, we have seen bias, variance and MSE as the three core measures of performance. Other criteria exist, such as the median bias and the median absolute deviation of $$T$$, where we use the median $$\tilde{T}$$ of our estimates rather than the mean $$\bar{T}$$.

The usual bias, variance and MSE measures can be sensitive to outliers. If an estimator generally does well, except for an occasional large mistake, these classic measures can return very poor overall performance. Instead, we might turn to quantities such as the median bias (sort all the estimation errors across the simulation scenarios, and take the middle), or the Median Absolute Distance (MAD, where you take the median of the absolute values of the errors, which is an alternative to RMSE) as a measure of performance.

Other robust measures are also possible, such as simply truncating all errors to a maximum size (this is called Windsorizing). This is a way of saying “I don’t care if you are off by 1000, I am only going to count it as 10.”

### 9.7.3 Summary of Peformance Measures

We list most of the performance criteria we saw in this chapter in the table below, for reference:

Criterion Definition Estimate
Bias $$\text{E}(T) - \theta$$ $$\bar{T} - \theta$$
Median bias $$\text{M}(T) - \theta$$ $$\tilde{T} - \theta$$
Variance $$\text{E}\left[\left(T - \text{E}(T)\right)^2\right]$$ $$S_T^2$$
MSE $$\text{E}\left[\left(T - \theta\right)^2\right]$$ $$\left(\bar{T} - \theta\right)^2 + S_T^2$$
MAD $$\text{M}\left[\left|T - \theta\right|\right]$$ $$\left[\left|T - \theta\right|\right]_{R/2}$$
Relative bias $$\text{E}(T) / \theta$$ $$\bar{T} / \theta$$
Relative median bias $$\text{M}(T) / \theta$$ $$\tilde{T} / \theta$$
Relative MSE $$\text{E}\left[\left(T - \theta\right)^2\right] / \theta^2$$ $$\frac{\left(\bar{T} - \theta\right)^2 + S_T^2}{\theta^2}$$
• Bias and median bias are measures of whether the estimator is systematically higher or lower than the target parameter.
• Variance is a measure of the precision of the estimator—that is, how far it deviates from its average. We might look at the square root of this, to assess the precision in the units of the original measure. This is the true SE of the estimator.
• Mean-squared error is a measure of overall accuracy, i.e. is a measure how far we typically are from the truth. We more frequently use the root mean-squared error, or RMSE, which is just the square root of the MSE.
• The median absolute deviation (MAD) is another measure of overall accuracy that is less sensitive to outlier estimates. The RMSE can be driven up by a single bad egg. The MAD is less sensitive to this.

## 9.8 Uncertainty in Performance Estimates (the MCSE)

Our performance criteria are defined as average performance across an infinite number of trials. Of course, in our simulations we only run a finite number of trials, and estimate the performance criteria with the sample of trials we generate. For example, if we are assessing coverage across 100 trials, we can calculate what fraction rejected the null for that 100. This is an estimate of the true coverage rate. Due to random chance, we might see a higher, or lower, proportion rejected than what we would see if we ran the simulation forever.

To account for estimation uncertainty we want associated uncertainty estimates to go with our point estimates of performance. We want to, in other words, treat our simulation results as a dataset in its own right. (And yes, this is quite meta!)

Once we frame the problem in these terms, it is relatively straightforward to calculate standard errors for most of the performance critera because we have an independent and identically distributed set of measurements. We call these standard errors Monte Carlo Simulation Errors, or MCSEs. For some of the performance criteria we have to be a bit more clever, as we will discuss below.

We list MCSE expressions for many of our straightforward performance measures on the following table. In reading the table, recall that, for an estimator $$T$$, we have $$S_T$$ being the standard deviation of $$T$$ across our simulation runs (i.e., our estimated true Standard Error). We also have

• Sample skewness (standardized): $$\displaystyle{g_T = \frac{1}{R S_T^3}\sum_{r=1}^R \left(T_r - \bar{T}\right)^3}$$
• Sample kurtosis (standardized): $$\displaystyle{k_T = \frac{1}{R S_T^4} \sum_{r=1}^R \left(T_r - \bar{T}\right)^4}$$
Criterion for T MCSE
Bias ($$T-\theta$$) $$\sqrt{S_T^2/ R}$$
Variance ($$S_T^2$$) $$\displaystyle{S_T^2 \sqrt{\frac{k_T - 1}{R}}}$$
MSE see below
Power & Validity ($$r_\alpha$$) $$\sqrt{ r_\alpha \left(1 - r_\alpha\right) / R}$$
Coverage ($$\omega_\beta$$) $$\sqrt{\omega_\beta \left(1 - \omega_\beta\right) / R}$$
Average length ($$\text{E}(W)$$) $$\sqrt{S_W^2 / R}$$

The MCSE for the MSE is a bit more complicated, and does not quite fit on our table: $\widehat{MCSE}( \widehat{MSE} ) = \displaystyle{\sqrt{\frac{1}{R}\left[S_T^4 (k_T - 1) + 4 S_T^3 g_T\left(\bar{T} - \theta\right) + 4 S_T^2 \left(\bar{T} - \theta\right)^2\right]}} .$

For relative quantities with respect to an estimand, simply divide the criterion by the target estimand. E.g., for relative bias $$T / \theta$$, the standard error would be $SE\left( \frac{T}{\theta} \right) = \frac{1}{\theta} SE(T) = \sqrt{\frac{S_T^2}{R\theta^2}} .$

For square rooted quantities, such as the SE for the true SE (square root of the Variance) or the RMSE (square root of MSE) we can use the Delta method. The Delta method says (with some conditions), that if we assume $$X \sim N( \phi, V )$$, then we can approximate the distribution of $$g(X)$$ for some continuous function $$g(\cdot)$$ as $g(X) \sim N\left( g(\phi), \;\; g'(\phi)^2\cdot V \right) ,$ where $$g'(\phi)$$ is the derivative of $$g(\cdot)$$ evaluated at $$\phi$$. In other words, $SE( g(\hat{X}) ) \approx g'(\theta) \times SE(\hat{X}) .$ For estimation, we plug in $$\hat{\theta}$$ and our estimate of $$SE(\hat{X})$$ into the above. Back to the square root, we have $$g(x) = \sqrt(x)$$ and $$g'(x) = 1/2\sqrt(x)$$. This gives, for example, the estimated MCSE of the SE as $\widehat{SE}( \widehat{SE} ) = \widehat{SE}( S^2_T ) = \frac{1}{2S^2_T} \widehat{SE}( S^2_T ) = \frac{1}{2S^2_T} S_T^2 \sqrt{\frac{k_T - 1}{R}} = \frac{1}{2} \sqrt{\frac{k_T - 1}{R}} .$

### 9.8.1 MCSE for Relative Variance Estimators

Estimating the MCSE of the relative bias or relative MSE of a (squared) standard error estimator, i.e., of $$E( \widehat{SE^2} - SE^2 ) / SE^2 )$$ or $$\widehat{MSE} / MSE$$, is complicated by the appearance of an estimated quantity, $$SE^2$$ or $$MSE$$, in the denominator of the ratio. This renders the simple division approach from above unusable, technically speaking. The problem is we cannot use our clean expressions for MCSEs of relative performance measures since we are not taking the uncertainty of our denominator into account.

To properly assess the overall MCSE, we need to do something else. One approach is to use the jackknife technique. Let $$\bar{V}_{(j)}$$ and $$S_{T(j)}^2$$ be the average squared standard error estimate and the true variance estimate calculated from the set of replicates that excludes replicate $$j$$, for $$j = 1,...,R$$. The relative bias estimate, excluding replicate $$j$$ would then be $$\bar{V}_{(j)} / S_{T(j)}^2$$. Calculating all $$R$$ versions of this relative bias estimate and taking the variance of these $$R$$ versions yields the jackknife variance estimator:

$MCSE\left( \frac{ \widehat{SE}^2 }{SE^2} \right) = \frac{1}{R} \sum_{j=1}^R \left(\frac{\bar{V}_{(j)}}{S_{T(j)}^2} - \frac{\bar{V}}{S_T^2}\right)^2.$

This would be quite time-consuming to compute if we did it by brute force. However, a few algebra tricks provide a much quicker way. The tricks come from observing that

\begin{aligned} \bar{V}_{(j)} &= \frac{1}{R - 1}\left(R \bar{V} - V_j\right) \\ S_{T(j)}^2 &= \frac{1}{R - 2} \left[(R - 1) S_T^2 - \frac{R}{R - 1}\left(T_j - \bar{T}\right)^2\right] \end{aligned} These formulas can be used to avoid re-computing the mean and sample variance from every subsample. Instead, you calculate the overall mean and overall variance, and then do a small adjustment with each jackknife iteration. You can even implement this with vector processing in R!

### 9.8.2 Calculating MCSEs With the simhelpers Package

The simhelper package is designed to calculate MCSEs (and the performance metrics themselves) for you. It is easy to use: take this set of simulation runs on the Welch dataset:

library( simhelpers )
data( welch_res )
welch <- welch_res %>%
filter( method == "t-test" ) %>%
dplyr::select( -method, -seed, -iterations )

welch
## # A tibble: 8,000 × 8
##       n1    n2 mean_diff      est    var p_val
##    <dbl> <dbl>     <dbl>    <dbl>  <dbl> <dbl>
##  1    50    50         0  0.0258  0.0954 0.934
##  2    50    50         0  0.00516 0.0848 0.986
##  3    50    50         0 -0.0798  0.0818 0.781
##  4    50    50         0 -0.0589  0.102  0.854
##  5    50    50         0  0.0251  0.118  0.942
##  6    50    50         0 -0.115   0.106  0.725
##  7    50    50         0  0.157   0.115  0.645
##  8    50    50         0 -0.213   0.121  0.543
##  9    50    50         0  0.509   0.117  0.139
## 10    50    50         0 -0.354   0.0774 0.206
## # ℹ 7,990 more rows
## # ℹ 2 more variables: lower_bound <dbl>,
## #   upper_bound <dbl>

We can calculate performance metrics across all the range of scenarios. Here is the rejection rate:

welch_sub = filter( welch, n1 == 50, n2 == 50, mean_diff==0 )
calc_rejection(welch_sub, p_val)
##   K_rejection rej_rate rej_rate_mcse
## 1        1000    0.048   0.006759882

And coverage:

calc_coverage(welch_sub, lower_bound, upper_bound, mean_diff)
## # A tibble: 1 × 5
##   K_coverage coverage coverage_mcse width
##        <int>    <dbl>         <dbl> <dbl>
## 1       1000    0.952       0.00676  1.25
## # ℹ 1 more variable: width_mcse <dbl>

Using tidyverse it is easy to process across scenarios (more on experimental design and multiple scenarios later):

welch %>% group_by(n1,n2,mean_diff) %>%
summarise( calc_rejection( p_values = p_val ) )
## # A tibble: 8 × 6
## # Groups:   n1, n2 [2]
##      n1    n2 mean_diff K_rejection rej_rate
##   <dbl> <dbl>     <dbl>       <int>    <dbl>
## 1    50    50       0          1000    0.048
## 2    50    50       0.5        1000    0.34
## 3    50    50       1          1000    0.876
## 4    50    50       2          1000    1
## 5    50    70       0          1000    0.027
## 6    50    70       0.5        1000    0.341
## 7    50    70       1          1000    0.904
## 8    50    70       2          1000    1
## # ℹ 1 more variable: rej_rate_mcse <dbl>

### 9.8.3 MCSE Calculation in our Cluster RCT Example

We can check our MCSEs for our performance measures to see if we have enough simulation trials to give us precise enough estimates to believe the differences we reported earlier. In particular, we have:

library( simhelpers )
runs\$ATE = ATE
runs %>%
summarise( calc_absolute( estimates = ATE_hat,
true_param = ATE,
criteria = c("bias","stddev", "rmse")) ) %>%
dplyr::select( -K_absolute ) %>%
knitr::kable(digits=3)
bias bias_mcse stddev stddev_mcse rmse rmse_mcse
0.034 0.003 0.178 0.002 0.181 0.003

We see the MCSEs are quite small relative to the linear regression bias term and all the SEs (stddev) and RMSEs: we have simulated enough runs to see the gross trends identified. We have not simulated enough to for sure know if MLM and Agg are not slightly biased. Given our MCSEs, they could have true bias of around 0.01 (two MCSEs).

## 9.9 Exercises

1. Continuing the exercises from the prior chapters, estimate rejection rates of the BFF* test for the parameter values in the fifth line of Table 1 of Brown and Forsythe (1974).

2. Implement the jackknife as described above in code. Check your answers against the simhelpers package for the built-in t_res dataset:

library( simhelpers )
calc_relative(data = t_res, estimates = est, true_param = true_param)
## # A tibble: 1 × 7
##   K_relative rel_bias rel_bias_mcse rel_mse
##        <int>    <dbl>         <dbl>   <dbl>
## 1       1000     1.00        0.0128   0.163
## # ℹ 3 more variables: rel_mse_mcse <dbl>,
## #   rel_rmse <dbl>, rel_rmse_mcse <dbl>
1. As foreground to the following chapters, can you explore multiple scenarios for the cluster RCT example to see if the trends are common? First write a function that takes a parameter, runs the entire simulation, and returns the results as a small table. You pick which parameter, e.g., average treatment effect, alpha, or whatever you like), that you wish to vary. Here is a skeleton for the function:
my_simulation <- function( my_param ) {
# call the sim_function() simulation function from the end of last
# chapter, setting the parameter you want to vary to my_param

# Analyze the results, generating a table of performance metrics,
# e.g., bias or coverage. Make sure your analysis is a data frame,
# like we saw earlier this chapter.

# Return results
}

Then use code like the following to generate a set of results measured as a function of a varying parameter:

vals = seq( start, stop, length.out = 5 )
res = map_df( vals, my_simulation ) 

The above code will give you a data frame of results, one column for each performance measure. Finally, you can use this table and plot the performance measure as a function of the varying parameter.

### References

Sundberg, Rolf. 2003. Conditional statistical inference and quantification of relevance.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (1): 299–315.
Westfall, Peter H, and Kevin SS Henning. 2013. Understanding Advanced Statistical Methods. Vol. 543. CRC Press Boca Raton, FL.