Chapter 7 Estimation procedures
We do simulation studies to understand how to analyze data. Thus, the central object of study is a data analysis procedure, or a set of steps or calculations carried out on a dataset. We want to know how well a procedure would work when applied in practice, and the data generating processes we described in Chapter 6 provides a stand-in for real data. To revisit our consumer product testing analogy from Chapter 1, the data analysis procedure is the product, and the data generating process is the set of trials to which we subject the product.
Depending on the research question, a data-analysis procedure might be very simple—as simple as just computing a sample correlation—or it might involve something more complex, such as fitting a multilevel model and generating a confidence interval for a specific coefficient. A data analysis procedure might even involve a combination of several components. For example, the procedure might entail first running a variable screening procedure and then fitting a random forest on the selected predictor variables. As another example, a data-analysis procedure might involve using multiple imputation for missingness on key variables, then fitting a statistical model, and then generating predicted values based on the model. For sake of brevity, we will use the term estimation procedure or just estimator to encompass all of these procedures, even ones that involve multi-step or multi-component procedures.
In this chapter, we demonstrate how to implement an estimator in the form of an R function, which we call an estimation function, so that its performance can eventually be evaluated by repeatedly applying it to artificial data. We start by describing the high-level design of an estimation function and illustrate with some simple examples in Section 7.1.
Depending on the research question, we will often be interested in comparing several competing estimators. In this case, we will create several functions that implement the estimators that we plan to compare. The easiest approach here is to implement each estimator in turn, and then bundle the collection in a final overall function. We describe how to do this in Section 7.2.
In Section 7.3, we describe strategies for validating the coded-up estimator before running a full simulation. We offer a few strategies for validating one’s code, including checking against existing implementations, checking theoretical properties, and using simulations to check for bugs.
For a full simulation, an estimator needs to be reliable, not crashing and giving answers across the wide range of datasets to which it is applied. An estimator will need to be able to handle odd edge cases or pathological datasets that might be generated as we explore a full range of simulation scenarios. To allow for this, Section 7.4 demonstrates several methods for handling common computational problems, such as errors or warnings.
7.1 Writing estimation functions
In the abstract, a function that implements an estimation procedure should have the following form:
estimator <- function(data) {
# calculations/model-fitting/estimation procedures
return(estimates)
}An estimator function should take a dataset as input, fit a model or otherwise calculates an estimate, possibly with associated standard errors and so forth, and return these quantities as output. It can return point estimates of parameters, standard errors, confidence intervals, \(p\)-values, predictions, or other quantities. The calculations in the body of the function should be set up to use datasets that have the same structure (i.e., same dimensions, same variable names) as the output of the corresponding function for generating simulated data. However, in principle, we should also be able to run the estimation function on real data as well.
As a first example, suppose we want to evaluate a method for generating a confidence interval for Pearson’s sample correlation coefficient when faced with non-normal data.
For bivariate normal data, statistical theory provides some very useful facts.
In particular, it tells us that applying Fisher’s \(z\)-transformation of Pearson’s correlation coefficient, which is equivalent to the hyperbolic arc-tangent function (atanh() in R), produces a statistic that is very close to normally distributed.
It also tells us that the empirical standard error of the \(z\)-transformed correlation coefficient is simply \(1 / \sqrt{N - 3}\), and thus independent of the correlation parameter.
This makes \(z\)-transformation very useful for computing confidence intervals, which can then be back-transformed to the Pearson-\(r\) scale.
However, these results are specific to bivariate normal distributions.
Would this transformation work well in the face of non-normal data, such as the bivariate Poisson distribution we coded in Chapter 6?
For studying this question with simulation, a simple estimation function would take a dataset with two variables as input and compute the sample correlation and its \(z\)-transformation, compute confidence intervals for \(z\), and then back-transform the confidence interval end-points. Here is an implementation of this sequence of calculations:
r_and_z <- function(data) {
r <- cor(data$C1, data$C2)
z <- atanh(r)
se_z <- 1 / sqrt(nrow(data) - 3)
ci_z <- z + c(-1, 1) * qnorm(.975) * se_z
ci_r <- tanh(ci_z)
data.frame( r = r, z = z, CI_lo = ci_r[1], CI_hi = ci_r[2] )
}To check that the function returns a result of the expected form, we generate a small dataset using the r_bivariate_Poisson() function developed in the last chapter, then apply our estimation function to the result:
## r z CI_lo CI_hi
## 1 0.6371712 0.753397 0.4063077 0.7915666
Although it is a little cumbersome to do so, we could also apply the estimation function to a real dataset.
Here is an example, which calculates the correlation between ratings of judicial integrity and familiarity with the law from the USJudgeRatings dataset (which is included in base R).
For the function to work on this dataset, we first need to rename the relevant variables because our function takes a data.frame with two columns named C1 and C2:
## r z CI_lo CI_hi
## 1 0.868858 1.328401 0.7692563 0.9272343
The function returns a valid result—a quite strong correlation in this case!
It is good practice to test out a newly-developed estimation function on real data as a check that it is working as intended. Testing on real data ensures that the estimation function is not using information outside of the dataset, such as by using known parameter values to construct an estimate. Applying the function to a real dataset demonstrates that the function implements a procedure that could actually be applied in real data analysis contexts.
7.2 Including Multiple Data Analysis Procedures
Many simulations involve head-to-head comparisons between more than one data-analysis procedure. As a design principle, we generally recommend writing different functions for each estimation method one is planning on evaluating. Doing so makes it easier to add in additional methods as desired or to focus on just a subset of methods. Writing separate functions also leads to a code base that is flexible and useful for other purposes (such as analyzing real data). Finally (repeating one of our favorite mantras), separating functions makes debugging easier because it lets you focus attention on one thing at a time, without worrying about how errors in one area might propagate to others.
To see how this works in practice, we return to the case study from Section 6.6, where we developed a data-generating function for simulating a cluster-randomized trial with student-level outcomes but school-level treatment assignment. Our data-generating process allowed for varying school sizes and heterogeneous treatment effects that are correlated with school size. Several different procedures might be used to estimate an overall average effect from a clustered experiment. We will consider three different procedures:
- Fitting a multi-level regression model (also known as a hierarchical linear model),
- Fitting an ordinary least squares (OLS) regression model and applying cluster-robust standard errors, or
- Averaging the outcomes by school, then fitting a linear regression model on the mean outcomes.
All three of these methods are widely used and have some theoretical guarantees supporting their use. Education researchers tend to be more comfortable using multi-level regression models, whereas economists tend to use OLS with clustered standard errors.
We next develop estimation functions for each of these procedures, focusing on a simple model that does not include any covariates besides the treatment indicator.
Each function needs to produce a point estimate, standard error, and \(p\)-value for the average treatment effect.
To have data to practice on, we generate a sample dataset using a revised version of gen_cluster_RCT(), which corrects the bug discussed in Exercise 6.9.8:
dat <- gen_cluster_RCT(
J=16, n_bar = 30, alpha = 0.8, p = 0.5,
gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
sigma2_u = 0.4, sigma2_e = 0.6
)For the multi-level modeling strategy, there are several different existing packages that we could use.
We will implement an estimator using the popular lme4 package, via the lmerTest function which calculates a \(p\)-value for the average effect.
Here is a basic implementation:
analysis_MLM <- function( dat ) {
M1_test <- lmerTest::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat )
M1_summary <- summary(M1_test)$coefficients
tibble(
ATE_hat = M1_summary["Z","Estimate"],
SE_hat = M1_summary["Z","Std. Error"],
p_value = M1_summary["Z", "Pr(>|t|)"]
)
}The function fits a multi-level model with a fixed coefficient for the treatment indicator and random intercepts for each school. It outputs only the statistics in which we are interested.
Our function makes use of the lmerTest package.
Rather than assuming that this package will be loaded, we call the relevant function using the package name as a prefix: lmerTest::lmer().
This way, we can run the function even if we have not loaded the package in the global environment.
Referencing functions with their package prefix is also preferable to loading packages inside the function itself (e.g., with require(lmerTest)) because calling the function then does not change which packages are loaded in the global environment.
Here is a function implementing OLS regression with cluster-robust standard errors:
analysis_OLS <- function( dat, se_type = "CR2" ) {
M2 <- estimatr::lm_robust(
Yobs ~ 1 + Z, data = dat,
clusters = sid, se_type = se_type
)
tibble(
ATE_hat = M2$coefficients[["Z"]],
SE_hat = M2$std.error[["Z"]],
p_value = M2$p.value[["Z"]]
)
}To get cluster-robust standard errors, we use the lm_robust() function from the estimatr() package, again calling only the relevant function using the package prefix rather than loading the whole package.
A novel aspect of this estimation function is that it includes an additional intput argument, se_type, which allows us to control the type of standard error calculated by lm_robust().
Adding this option would let us use the same function to compute (and compare) different types of clustered standard errors for the average treatment effect estimate.
We set a default option of "CR2", just like the default of lm_robust().
Sometimes an analytic procedure involves multiple steps.
For example, the aggregation estimator first involves collapsing the data to a school-level dataset, and then analyzing at the school level.
This is no problem:
we just wrap all the steps in a single estimation function.
Once written, all we need to do is call the function—no matter how complicated the process inside.
Here is the code for the aggregate-then-analyze approach:
analysis_agg <- function( dat, se_type = "HC2" ) {
datagg <- dplyr::summarise(
dat,
Ybar = mean( Yobs ),
n = n(),
.by = c(sid, Z)
)
stopifnot( nrow( datagg ) == length(unique(dat$sid) ) )
M3 <- estimatr::lm_robust(
Ybar ~ 1 + Z, data = datagg,
se_type = se_type
)
tibble(
ATE_hat = M3$coefficients[["Z"]],
SE_hat = M3$std.error[["Z"]],
p_value = M3$p.value[["Z"]]
)
}Note the stopifnot command: it will throw an error if the condition is not true.
This stopifnot ensures we do not have both treatment and control students within a single school–if we did, we would have more aggregated values than school ids due to the grouping!
Putting assert statements in your code like this is a good way to guarantee you are not introducing weird and hard-to-track errors.
A stopifnot statement halts your code as soon as something goes wrong, rather than letting that initial error flow on to further work, potentially creating odd results that are hard to understand.
Here we are protecting ourselves from strange results if, for example, we messed up our DGP code to have treatment not nested within school, or if we were using data that did not actually come from a cluster randomized experiment.
See Section 19.4 for more.
All of our functions produce output in the same format:
## # A tibble: 1 × 3
## ATE_hat SE_hat p_value
## <dbl> <dbl> <dbl>
## 1 -0.176 0.383 0.653
## # A tibble: 1 × 3
## ATE_hat SE_hat p_value
## <dbl> <dbl> <dbl>
## 1 -0.124 0.462 0.793
## # A tibble: 1 × 3
## ATE_hat SE_hat p_value
## <dbl> <dbl> <dbl>
## 1 -0.178 0.378 0.645
Ensuring that the output of all the functions is structured in the same way will make it easy to keep the results organized once we start running multiple iterations of the simulation. If each estimation method returns a dataset with the same variables, we can simply stack the results on top of each other. Here is a function that bundles all the estimation procedures together:
estimate_Tx_Fx <- function(
data,
CR_se_type = "CR2", agg_se_type = "HC2"
) {
dplyr::bind_rows(
MLM = analysis_MLM( data ),
OLS = analysis_OLS( data, se_type = CR_se_type),
agg = analysis_agg( data, se_type = agg_se_type),
.id = "estimator"
)
}## # A tibble: 3 × 4
## estimator ATE_hat SE_hat p_value
## <chr> <dbl> <dbl> <dbl>
## 1 MLM -0.176 0.383 0.653
## 2 OLS -0.124 0.462 0.793
## 3 agg -0.178 0.378 0.645
This is a common coding pattern for simulations that involve multiple estimators. Each procedure is expressed in its own function, then these are assembled together in a single function so that they can all easily be applied to the same dataset. Stacking the results row-wise will make it easy to compute performance measures for all methods at once. The benefit of stacking will become even more evident once we are working across multiple replications of the simulation process, as we will in Chapter 8.
7.3 Validating an Estimation Function
Just as with data-generating functions, it is critical to verify that an estimation function is correctly implemented. If an estimation function involves a known procedure that has been implemented in R or one of its contributed packages, then a straightforward way to do this is to compare your implementation to another existing implementation. For estimation functions that involve multi-step procedures or novel methods, other approaches to verification may be needed.
7.3.1 Checking against existing implementations
In Chapter 5, we wrote a function called ANOVA_Welch_F() for computing \(p\)-values from two different procedures for testing equality of means in a heteroskedastic ANOVA:
ANOVA_Welch_F <- function(data) {
anova_F <- oneway.test(x ~ group, data = data, var.equal = TRUE)
Welch_F <- oneway.test(x ~ group, data = data, var.equal = FALSE)
result <- tibble(
ANOVA = anova_F$p.value,
Welch = Welch_F$p.value
)
return(result)
}We can test this function by comparing its output against the built-in oneway.test function, as follows:
sim_data <- generate_ANOVA_data(
mu = c(1, 2, 5, 6),
sigma_sq = c(3, 2, 5, 1),
sample_size = c(3, 6, 2, 4)
)
aov_results <- oneway.test(x ~ factor(group), data = sim_data, var.equal = FALSE)
aov_results##
## One-way analysis of means (not assuming
## equal variances)
##
## data: x and factor(group)
## F = 15.577, num df = 3.0000, denom df =
## 3.0993, p-value = 0.02275
## [1] TRUE
We use all.equal() because it will check equality up to a tolerance in R, which can avoid some perplexing errors due to rounding.
For the bivariate correlation example introduced in Section 7.1, we can check the output of r_and_z() against R’s built-in cor.test() function:
Pois_dat <- r_bivariate_Poisson(15, rho = 0.6, mu1 = 14, mu2 = 8)
my_result <- r_and_z(Pois_dat) |> subset(select = -z)
my_result## r CI_lo CI_hi
## 1 0.7474359 0.3810838 0.9109217
R_result <- cor.test(~ C1 + C2, data = Pois_dat)
R_result <- tibble(r = R_result$estimate[["cor"]],
CI_lo = R_result$conf.int[1],
CI_hi = R_result$conf.int[2])
R_result## # A tibble: 1 × 3
## r CI_lo CI_hi
## <dbl> <dbl> <dbl>
## 1 0.747 0.381 0.911
## [1] TRUE
This type of check is all the more useful here because r_and_z() uses our own implementation of the confidence interval calculations, rather than relying on R’s built-in functions as we did with ANOVA_Welch_F().
One might ask why you should bother implementing your own function if you already have a version implemented in R. In some cases, it might be possible to implement a faster version of a function because you can cut corners by skipping input data verification, providing fewer estimation options, or cutting out post-estimation processing. Any of these could help with the overall speed of your simulation. On the other hand, gains in computational efficiency might not be worth the human time it would take to implement the calculations from scratch. See Chapter A.4 for more discussion of this trade-off.
7.3.2 Checking novel procedures
Simulations are usually an integral part of projects to develop novel statistical methods. Checking estimation functions in such projects presents a challenge: if an estimator truly is new, how do you check that your code is correct? Effective methods for doing so will vary from problem to problem, but an over-arching strategy is to use theoretical results about the performance of the estimator to check that your implementation works as expected. For instance, we might work out the algebraic properties of an estimator for a special case and then check that the result of the estimation function agrees with our algebra. For some estimation problems, we might be able to identify theoretical properties of an estimator when applied to a very large sample of data and when the model is correctly specified. If we can find results about large-sample behavior, then we can test an estimation function by applying it to a very large sample and checking whether the resulting estimates are very close to specified parameter values. We illustrate each of these approaches using our functions for estimating treatment effects from cluster-randomized trials.
We start by testing an algebraic property. With each of the three methods we have implemented, the treatment effect estimator is a difference between the weighted average of the outcomes from students in each treatment condition; the only difference between the estimators is in what weights are used. In the special case where all schools have the same number of students, the weights used by all three methods end up being the same: all three methods allocate equal weight to each school. Therefore, we know that there should be no difference between the three point estimates. Furthermore, a bit of algebra will show that the cluster-robust standard error from the OLS approach will end up being identical to the robust standard error from the aggregation approach. If there are also equal numbers of schools assigned to both conditions, then the standard error from the multilevel model will also be identical to the other standard errors.
Let’s verify that our estimation functions produce results that are consistent with these theoretical properties. To do so, we will need to generate a dataset with equal cluster sizes, setting \(\alpha = 0\):
dat <- gen_cluster_RCT(
J=12, n_bar = 30, alpha = 0, p = 0.5,
gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
sigma2_u = 0.4, sigma2_e = 0.6
)
table(dat$sid) # verify equal-sized clusters##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 30 30 30 30 30 30 30 30 30 30 30 30
## # A tibble: 3 × 4
## estimator ATE_hat SE_hat p_value
## <chr> <dbl> <dbl> <dbl>
## 1 MLM 0.456 0.401 0.282
## 2 OLS 0.456 0.401 0.282
## 3 agg 0.456 0.401 0.282
All three methods yield identical results. Now let’s try equal school sizes but unequal allocation to treatment:
dat <- gen_cluster_RCT(
J=12, n_bar = 30, alpha = 0, p = 2 / 3,
gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
sigma2_u = 0.4, sigma2_e = 0.6
)
estimate_Tx_Fx(dat)## # A tibble: 3 × 4
## estimator ATE_hat SE_hat p_value
## <chr> <dbl> <dbl> <dbl>
## 1 MLM 0.292 0.309 0.367
## 2 OLS 0.292 0.260 0.303
## 3 agg 0.292 0.260 0.287
As expected, all three point estimators match, but the SE from the multilevel model is a little bit discrepant from the others.
We can also use large-sample theory to check the multilevel modeling estimator.
If the model is correctly specified, then all the parameters of the model should be accurately estimated if the model is fit to a very large sample of data.
To check this property, we will need access to the full model output, not just the selected results returned by analysis_MLM().
One way to handle this is to make a small tweak to the estimation function, adding an option to control whether to return the entire model or just selected results.
Here is the tweaked function:
analysis_MLM <- function( dat, all_results = FALSE) {
M1_test <- lmerTest::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat )
if (all_results) {
return(summary(M1_test))
}
M1_summary <- summary(M1_test)$coefficients
tibble(
ATE_hat = M1_summary["Z","Estimate"],
SE_hat = M1_summary["Z","Std. Error"],
p_value = M1_summary["Z", "Pr(>|t|)"]
)
}Setting all_results to TRUE will return the entire function; keeping it at the default value of FALSE will return the same output as the other functions.
Now let’s apply the estimation function to a very large dataset, with variation in cluster sizes.
We set gamma_2 = 0 so that the estimation model is correctly specified:
dat <- gen_cluster_RCT(
J=5000, n_bar = 20, alpha = 0.9, p = 2 / 3,
gamma_0 = 2, gamma_1 = 0.30, gamma_2 = 0,
sigma2_u = 0.4, sigma2_e = 0.6
)
analysis_MLM(dat, all_results = TRUE)## Linear mixed model fit by REML. t-tests use
## Satterthwaite's method [lmerModLmerTest]
## Formula: Yobs ~ 1 + Z + (1 | sid)
## Data: dat
##
## REML criterion at convergence: 241831
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.5540 -0.6605 0.0009 0.6593 4.3081
##
## Random effects:
## Groups Name Variance Std.Dev.
## sid (Intercept) 0.3924 0.6264
## Residual 0.5984 0.7736
## Number of obs: 98739, groups: sid, 5000
##
## Fixed effects:
## Estimate Std. Error df
## (Intercept) 2.005e+00 1.627e-02 4.960e+03
## Z 3.028e-01 1.992e-02 4.949e+03
## t value Pr(>|t|)
## (Intercept) 123.2 <2e-16 ***
## Z 15.2 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Z -0.817
The intercept and treatment effect coefficient estimates are very close to their true parameter values, as are the estimated school-level variance and student-level residual variance.
This all gives some assurance that the analysis_MLM() function is working properly.
Of course, it is important to bear in mind that these tests are only partial verifications.
With the algebraic test, we have only checked that the functions seem to be working properly for scenarios with equal school sizes, but they still might have errors that only appear when school sizes vary.
Likewise, analysis_MLM() seems to be working properly for very large datasets, but our test does not rule out the possibility of bugs that only crawl out when \(J\) is small.
Our large-sample test also relies on the correctness of the gen_cluster_RCT() function;
if we had seen a discrepancy between parameters and estimates from the multilevel model, it could have been because of a problem with the data-generation function rather than with the estimation function.
These limitations are typical of what can be accomplished through tests based on theoretical results, because theoretical results typically only hold under specific conditions. After all, if we had comprehensive theoretical results, we would not need to simulate anything in the first place! Nonetheless, it is good to work through such tests to the extent that relevant theory is available for the problem you are studying.
7.3.3 Checking with simulations
Checking, debugging, and revising should not be limited to when you are initially developing estimation functions. It often happens that later steps in the process of conducting a simulation will reveal problems with the code for earlier steps. For instance, once you have run the data-generating and estimation steps repeatedly, calculated performance summaries, and created some graphs of the results, you might find an unusual or anomolous pattern in the performance of an estimator. This might be a legitimate result—perhaps the estimator really does behave weirdly or not work well—or it might be due to a problem in how you implemented the estimator or data-generating process. When faced with an unusual pattern, revisit the estimation code to double check for bugs and also think further about what might lead to the anomaly. Further exploration might lead you to a deeper understanding of how a method works and perhaps even an idea for how to improve the estimator or refine the data-generating process.
A good illustration of this process comes from one of Luke’s past research projects (see @pashley2024improving), in which he and other co-authors were working on a way to improve Instrumental Variable (IV) estimation using post-stratification. The method they studied involved grouping units based on a covariate that predicts compliance status, then calculating estimates within each group, then averaging the estimates across groups. They used simulations to see whether this method would improve the accuracy of the overall summary effect estimate. In the first simulation, the estimates were full of missing values and odd results because the estimation function failed to account for what happens in groups of observations where the number of compliers was estimated to be zero. After repairing that problem and re-running everything, the simulation results still indicated serious and unexpected bias, which turned out to be due to an error in how the estimation function implemented the step of averaging estimates across groups. After again correcting and re-running, the simulation results showed that the gains in accuracy from this new method were minimal, even when the groups were formed based on a variable that was almost perfectly predictive of compliance status. Eventually, we understood that the groups with very few compliers produced such unstable estimates that they spoiled the overall average estimate. This inspired us to revise our estimation strategy and introduce a method that dropped or down-weighted strata with few compliers, which ultimately helped us to strengthen the contribution of our work.
As this experience highlights, simulations seldom follow a single, well-defined trajectory. The point of conducting simulations is to help us, as researchers, learn about estimation methods so that we can analyze real data better. Simulations can strengthen our methodological and theoretical understanding, which can then inspire ideas for better approaches which we can then test in subsequent simulations. Of course, at some point one needs to step off this merry-go-round, write up the findings, cook dinner, and clean the bathroom. But, just like many other research endeavors, simulations are often a highly iterative process.
7.4 Handling errors, warnings, and other hiccups
Especially when working with more advanced estimation methods, it is possible that your estimation function will fail, throw an error, or return something uninterpretable for certain input datasets. For instance, maximum likelihood estimation often involves using iterative, numerical optimization algorithms that sometimes fail to converge. This might happen rarely enough that it takes a while to even notice that it is a problem, but even quite rare things can occur when you run simulations with many thousands of repetitions. Less dire but still annoying, your estimation function might occasionally generate warnings, which can pile up if you are running many repetitions. In some cases, such warnings might also signal that the estimator produced a bad result, and it may not be clear whether we should retain this result in the estimator’s overall performance assessments. After all, the function tried to warn us that something is off!
Errors and warnings in estimation functions pose two problems, one purely technical and one conceptual. On a technical level, R functions stop running if they hit an error, so we need ways to handle the errors in order to get our simulations up and running. Furthermore, warnings can clutter up the console and slow down code execution, so we may want to capture and suppress them as well. On a conceptual level, we need to decide how to use the information contained in errors and warnings, whether that be by further elaborating the estimators to address different contingencies or by evaluating the performance of the estimators in a way that appropriately accounts for these events. We consider both these problems here, and then revisit the conceptual considerations in Chapter 9, where we discuss assessing estimator performance.
7.4.1 Capturing errors and warnings
Some estimation functions will require complicated or stochastic calculations that can sometimes produce errors.
Intermittent errors can really be annoying and time-consuming if not addressed.
To protect yourself, it is good practice to anticipate potential errors, preventing them from stopping code execution and allowing your simulations to keep running.
We next demonstrate some techniques for error-handling using tools from the purrr package.
For illustrative purposes, consider the following error-prone function that sometimes returns what we want, sometimes returns NaN due to taking the square root of a negative number, and sometimes crashes completely because broken_code() does not exist:
my_complex_function = function( param ) {
vals = rnorm( param, mean = 0.5 )
if ( sum( vals ) > 5 ) {
broken_code( 4 )
} else {
sqrt( sum( vals ) * sign( vals )[[1]] )
}
}Running it a few times produces a mix of results and warnings:
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## [1] NaN
## [1] 2.11999
## [1] 0.09896136
Running it many times produces warnings, then an error:
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Error in broken_code(4): could not find function "broken_code"
We need to “trap” these warnings and errors so they do not clutter or stop our simulation. Let’s do the errors first.
The purrr package includes a function called possibly() that makes it easy to trap errors:
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## [1] NaN
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## [1] 0.8653335 NaN 2.0056463 NaN
## [5] NA NaN NaN 1.0718040
## [9] 1.5937092 2.0111968 NaN NA
## [13] 1.7401242 NaN 0.9721895 NaN
## [17] 2.0947465 NaN NaN NaN
possibly() is an example of a functional (or an abverb or “wrapper function”), which takes a function and returns a new function that does something slightly different.
The new function has the exact same parameters as the function we put into it.
The difference is the new version of the function will, if an error happens, return the value specified by the otherwise argument (here, NA), rather than stopping execution with an error.
It does not silence the warnings, however.
To handle warnings, the quietly() functional leads to results that are bundled together with any console output, warnings, and messages, rather than printing anything to the console:
## $result
## [1] 0.6065094
##
## $output
## [1] ""
##
## $warnings
## character(0)
##
## $messages
## character(0)
Wrapping a function with quietly() can be especially useful to reduce extraneous printing in a simulation, which can slow down code execution more than you might expect.
However, quietly() does not trap errors:
## Error in broken_code(4): could not find function "broken_code"
To handle both errors and warnings, we double-wrap the function, first with possibly() and then with quietly():
my_safe_quiet_function <- quietly( possibly( my_complex_function, otherwise = NA ) )
my_safe_quiet_function(7)## $result
## [1] 1.538658
##
## $output
## [1] ""
##
## $warnings
## character(0)
##
## $messages
## character(0)
To see how this works in practice, we will adapt our analysis_MLM() function, which makes use of lmer() for fitting a multilevel model.
Currently, the estimation function sometimes prints messages to the console:
set.seed(101012) # (hand-picked to show a warning.)
dat <- gen_cluster_RCT( J = 50, n_bar = 100, sigma2_u = 0 )
mod <- analysis_MLM(dat)## boundary (singular) fit: see help('isSingular')
Wrapping lmer() with quietly() makes it possible to catch such output and store it along with other results, as in the following:
quiet_safe_lmer <- quietly( possibly( lmerTest::lmer, otherwise=NULL ) )
M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1|sid), data=dat )
M1## $result
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: ..1
## Data: ..2
## REML criterion at convergence: 14026.44
## Random effects:
## Groups Name Std.Dev.
## sid (Intercept) 0.0000
## Residual 0.9828
## Number of obs: 5000, groups: sid, 50
## Fixed Effects:
## (Intercept) Z
## -0.013930 -0.008804
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
##
## $output
## [1] ""
##
## $warnings
## character(0)
##
## $messages
## [1] "boundary (singular) fit: see help('isSingular')\n"
We then pick apart the pieces and construct a dataset of results:
if ( is.null( M1$result ) ) {
# we had an error!
tibble( ATE_hat = NA, SE_hat = NA, p_value = NA,
message = M1$message,
warning = M1$warning,
error = TRUE )
} else {
sum <- summary( M1$result )
tibble(
ATE_hat = sum$coefficients["Z","Estimate"],
SE_hat = sum$coefficients["Z","Std. Error"],
p_value = sum$coefficients["Z", "Pr(>|t|)"],
message = list( M1$message ),
warning = list( M1$warning ),
error = FALSE )
}## # A tibble: 1 × 6
## ATE_hat SE_hat p_value message warning error
## <dbl> <dbl> <dbl> <list> <list> <lgl>
## 1 -0.00880 0.0278 0.751 <chr [1]> <chr> FALSE
Now we can plug in the above code to make a nicely quieted and safe function. This version runs without extraneous messages:
## # A tibble: 1 × 6
## ATE_hat SE_hat p_value message warning error
## <dbl> <dbl> <dbl> <list> <list> <lgl>
## 1 -0.00880 0.0278 0.751 <chr [1]> <chr> FALSE
Now we have the estimation results along with any diagnostic information from messages or warnings. Storing this information will let us evaluate what proportion of the time there was a warning or message, run additional analyses on the subset of replications where there was no such warning, or even modify the estimator to take the diagnostics into account. We have solved the technical problem—our code will run and give results—but not the conceptual one: what does it mean when our estimator gives an NA or a convergence warning with a nominal answer? How do we decide how good our estimator is when it does this?
7.4.2 Adapting estimators for errors and warnings
So far, we have seen techniques for handling technical hiccups that occur when data analysis procedures do not always produce results. But how do we account for the absence of results in a simulation? In Chapter 9, we will delve into the conceptual issues with summarizing the performance of methods that do not always provide an answer. However, one of the best solutions to such problems still concerns the formulation of estimation functions, and so we introduce it here. That solution is to re-define the estimator to include contingencies for handling a lack of results.
Consider a data analyst who was planning to apply a fancy statistical model to their data, but then finds that the model does not converge. What would that analyst do in practice (besides cussing and taking a snack break)? Rather than giving up entirely, they would probably think of an alternative analysis and attempt to apply it, perhaps by simplifying the model in some way. To the extent that we can anticipate such possibilities, we can build these error-contingent alternative analyses into our estimation function. Because of this, it would be more precise to talk about an estimation procedure or a data-analysis procedure rather than just an estimator. An analysis may be many different steps in a branching structure of analysis.
To illustrate, let’s look at an error (a not-particularly-subtle one) that could crop up in the cluster-randomized trial example when clusters are very small:
set.seed(65842)
tiny_dat <- gen_cluster_RCT( J = 10, n_bar = 2, alpha = 0.5)
analysis_MLM_safe(tiny_dat)## # A tibble: 0 × 6
## # ℹ 6 variables: ATE_hat <lgl>, SE_hat <lgl>,
## # p_value <lgl>, message <chr>, warning <chr>,
## # error <lgl>
##
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 1 1 1 1 1 1 1
The error occurs because all 10 simulated schools happened to include a single student, making it impossible to estimate a random-intercepts multilevel model. A natural fall-back analysis here would be to estimate an ordinary least squares regression analysis.
Suppose that our imaginary analyst is not especially into nuance, and so will fall back onto ordinary least squares whenever the multilevel model produces an error.
We can express this logic in our estimation function by first catching the error thrown by lmer() and then running an OLS regression in the event an error is thrown:
analysis_MLM_contingent <- function( dat ) {
M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data=dat )
if (!is.null(M1$result)) {
sum <- summary( M1$result )
tibble(
ATE_hat = sum$coefficients["Z","Estimate"],
SE_hat = sum$coefficients["Z","Std. Error"],
p_value = sum$coefficients["Z", "Pr(>|t|)"] )
} else {
# If lmer() errors, fall back on OLS
M_ols <- summary(lm(Yobs ~ Z, data = dat))
res <- tibble(
ATE_hat = M_ols$coefficients["Z","Estimate"],
SE_hat = M_ols$coefficients["Z", "Std. Error"],
p_value = M_ols$coefficients["Z","Pr(>|t|)"] )
}
# Store original messages, warnings, errors
res$message <- ifelse( length( M1$message ) > 0, M1$message, NA_character_ )
res$warning <- ifelse( length( M1$warning ) > 0, M1$warning, NA_character_ )
res$error <- is.null( M1$result )
return(res)
}We still store the messages, warnings, and errors from the initial lmer() fit so that we can keep track of how often errors occur.
The function now returns an treatment effect estimate even if lmer() errors:
## # A tibble: 1 × 6
## ATE_hat SE_hat p_value message warning error
## <dbl> <dbl> <dbl> <chr> <chr> <lgl>
## 1 0.400 0.603 0.525 <NA> <NA> TRUE
Of course, we can easily anticipate the conditions under which this particular error will occur: all we need to do is check whether all the clusters are single observations. Because it is easily anticipated, a better strategy for handling this error is to check before fitting the multilevel model and proceeding accordingly in the event that the clusters are all singletons. Exercise 7.5.5 asks you to implement this approach and further refine this contingent analysis strategy.
Adapting estimation functions to address errors can be an effective—and often very interesting—strategy for studying the performance of estimation methods. Rather than studying the performance of a data-analysis method that is only sometimes well-defined, we shift to studying a stylized cognitive model for the analyst’s decision-making process, which handles contingencies that might crop up whether analyzing simulated data or real empirical data. Of course, studying such a model is only interesting to the extent that the decision-making process it implements is a plausible representation of what an analyst might actually do in practice.
The adaptive estimation approach does lead to more complex estimation functions, which entail implementing multiple estimation methods and a set of decision rules for applying them. Often, the set of contingencies that need to be handled will not be immediately obvious, so you may find that you need to build and refine the decision rules as you learn more about how they work. Running an estimator over multiple, simulated datasets is an excellent (if aggravating!) way to identify errors and edge cases. We turn to procedures for doing so in Chapter 8.
7.4.3 The safely() option
There is one final purrr option, safely(), which traps the full error message, instead of just replacing the output with a fixed value as possibly() does.
safely() returns a list with two entries: the original result of the original function (or some fixed value if there was an error), and the error message (or NULL if there was no error).
To use it, we feed our function into safely() to create a new version:
## $result
## [1] 1.277916
##
## $error
## NULL
Just as with possibly(), we can include otherwise = NA to set a return value if there is an error.
We can use the safe function repeatedly and it will always return a result:
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
We still get warnings, but any errors are trapped so the function does not crash.
We end up with a 20-entry list, with each element consisting of a pair of the result and error message:
## [1] 20
## $result
## [1] NaN
##
## $error
## NULL
We can massage our data to a more easy to parse format:
## [1] NA NA NaN NaN
## [5] NA 1.1766144 NA 1.6221953
## [9] NA NaN NA 1.6086350
## [13] NaN 1.5356493 NA 1.1105727
## [17] NA NaN 1.1576487 0.8228205
The transpose() function takes a list of lists, and reorganizes them to give you a list of all the first elements, a list of all the second elements, etc.
transpose() is very powerful for wrangling data, because then we can make a tibble with list columns as so:
## # A tibble: 20 × 2
## result error
## <dbl> <list>
## 1 NA <smplErrr>
## 2 NA <smplErrr>
## 3 NaN <NULL>
## 4 NaN <NULL>
## # ℹ 16 more rows
We now have our results all organized, and we can see which iterations produced errors.
Unfortunately, to use safely() and quietly() together, we get a bit of a mess.
Extending our cluster RCT example, we have:
quiet_safe_lmer <- quietly( safely( lmerTest::lmer, otherwise=NULL ) )
M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data = dat )
M1## $result
## $result$result
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: ..1
## Data: ..2
## REML criterion at convergence: 14026.44
## Random effects:
## Groups Name Std.Dev.
## sid (Intercept) 0.0000
## Residual 0.9828
## Number of obs: 5000, groups: sid, 50
## Fixed Effects:
## (Intercept) Z
## -0.013930 -0.008804
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
##
## $result$error
## NULL
##
##
## $output
## [1] ""
##
## $warnings
## character(0)
##
## $messages
## [1] "boundary (singular) fit: see help('isSingular')\n"
The resulting object has the estimation results buried inside it. Even though the result is a bit of a mess, this structure provides all the pieces that we need, including any errors, warnings, and other output.
We can then have, for example, code such as this in our estimation function:
if ( is.null( M1$result$result ) ) {
# we had an error!
error = M1$result$error
} else {
# We did not. Do the same as above
error = NULL
}
# etc.Fortunately, once we have written all this code, we can tuck it inside our estimation function, and then forget about it. Once written, applying the function will produce a tidy table of results that we can easily analyze.
7.5 Exercises
7.5.1 More Heteroskedastic ANOVA
In the classic simulation by Brown and Forsythe (1974), they not only looked at the performance of the homoskedastic ANOVA-F test and Welch’s heteroskedastic-F test, they also proposed their own new hypothesis testing procedure.
Write a function that implements the Brown-Forsythe F* test (the BFF* test!) as described on p. 130 of Brown and Forsythe (1974), using the following code skeleton:
Run the following code to check that your function produces a result:
Try calling your
BF_Ffunction on a variety of datasets of different sizes and shapes, to make sure it works. What kinds of datasets should you test out?Add the BFF* test into the output of
Welch_ANOVA_F()by calling yourBF_F()function inside the body ofWelch_ANOVA_F().The
onewaytestspackage implements a variety of different hypothesis testing procedures for one-way ANOVA. Validate yourWelch_ANOVA_F()function by comparing the results to the output of the relevant functions fromonewaytests.
7.5.2 Contingent testing
In the one-way ANOVA problem, one approach that an analyst might think to take is to conduct a preliminary significance test for heterogeneity of variances (such as Levene’s test or Bartlett’s test), and then report the \(p\)-value from the homoskedastic ANOVA F test if variance heterogeneity is not detected but the \(p\)-value from the BFF* test if variance heteogeneity is detected.
Modify the Welch_ANOVA_F() function to return the \(p\)-value from this contingent BFF* test in addition to the \(p\)-values from the (non-contingent) ANOVA-F, Welch, and BFF* tests.
Include an input option that allows the user to control the \(\alpha\) level of the preliminary test for heterogeneity of variances, as in the following skeleton.
7.5.3 Check the cluster-RCT functions
Section 7.2 presented functions implementing several different strategies for estimating an average treatment effect from a cluster-randomized trial.
Write some code to validate these functions by comparing their output to the results of other tools for doing the same calculation.
Use one or more datasets simulated with gen_cluster_RCT().
For each of these tests, you will need to figure out the appropriate syntax by reading the package documentation.
For
analysis_MLM(), check the output by fitting the same model usinglme()from thenlme()package orglmmTMB()from the package of the same name.For
analysis_OLS(), check the output by fitting the linear model using the base R functionlm(), then computing standard errors usingvcovCL()from thesandwichpackage. Also compare the output to the results of feeding the fitted model throughcoef_test()from theclubSandwichpackage.For
analysis_agg(), check the output by aggregating the data to the school-level, fitting the linear model usinglm(), and computing standard errors usingvcovHC()from thesandwichpackage.
7.5.4 Extending the cluster-RCT functions
Exercise 6.9.10 from Chapter 6 asked you to extend the data-generating function for the cluster-randomized trial to include generating a student-level covariate, \(X\), that is predictive of the outcome. Use your modified function to generate a dataset.
Modify the estimation functions from Section 7.2 to use models that include the covariate as a predictor.
Further extend the functions to include an input argument for the set of predictors to be included in the model, as in the following skeleton for the multi-level model estimator:
analysis_MLM <- function(dat, predictors = "Z") { } analysis_MLM( dat ) analysis_MLM( dat, predictors = c("Z","X")) analysis_MLM( dat, predictors = c("Z","X", "X:Z"))Hint: Check out the
reformulate()function, which makes it easy to build formulas for different sets of predictors.
7.5.5 Contingent estimator processing
In Section 7.4.2 we developed a version of analysis_MLM() that fell back on OLS regression in the event that lmer() produced any error.
The implementation that we demonstrated is not especially smart.
For one, it does not anticipate that the error will occur.
For another, if we are using this function as one of several estimation strategies, it will require fitting the OLS regression multiple times.
Can you fix these problems?
Revise
analysis_MLM_contingent()so that it checks whether all clusters are single observations before fitting the multilevel model. Handle event the contingency where all clusters are singletons by skipping the model-fitting step, returningNAvalues for the point estimator, standard error, and p-value, and returning an informative message in theerrorvariable. Test that the function is working as expected.Revise
estimate_Tx_Fx()to use your new version ofanalysis_MLM_contingent(). The revised function will sometimes returnNAvalues for theMLMresults. To implement the strategy of falling-back on OLS regression, add some code that replaces anyNAvalues with corresponding results ofanalysis_OLS(). Test that the function is working as expected.
7.5.6 Estimating 3-parameter item response theory models
Exercise 6.9.11 asked you to write a data-generating function for the 3-parameter IRT model described in described in Section 6.7.
Use your function to generate a large dataset.
Using functions from the {ltm}, {mirt}, or {TAM} packages, estimate the parameters of the 3-parameter IRT model based on the simulated dataset.
Write a function to estimate a 3-parameter IRT model and return a dataset containing estimates of the item characteristics \((\alpha_m,\beta_m, \gamma_m)\).
Add an option to the function to allow the user to specify known values of \(\gamma_m\).
Create a graphic showing how the item parameter estimates compare to the true item characteristics.
Write a function or set of functions to apply 1-parameter, 2-parameter, and 3-parameter models and return datasets containing the person ability estimates \(\theta_1,...,\theta_N\) and corresponding standard errors of measurement.
7.5.7 Meta-regression with selective reporting
Exercise 6.9.15 asked you to write a data-generating function for the @vevea1995general selection model.
The {metafor} package includes a function for fitting this model (as well as a variety of other selection models).
Here is an example of the syntax for estimating this model, using a dataset from the {metadat} package:
library(metafor)
data("dat.assink2016", package = "metadat")
# rename variables and tidy up
dat <-
dat.assink2016 %>%
mutate( si = sqrt(vi) ) %>%
rename( Ti = yi, s_sq = vi, Xi = year )
# fit a random effects meta-regression model
rma_fit <- rma.uni(yi = Ti, sei = si, mods = ~ Xi, data = dat)
# fit two-step selection model
selmodel(rma_fit, type = "step", steps = c(.025, .500))##
## Mixed-Effects Model (k = 100; tau^2 estimator: ML)
##
## tau^2 (estimated amount of residual heterogeneity): 0.2314 (SE = 0.0500)
## tau (square root of estimated tau^2 value): 0.4810
##
## Test for Residual Heterogeneity:
## LRT(df = 1) = 349.3718, p-val < .0001
##
## Test of Moderators (coefficient 2):
## QM(df = 1) = 42.1433, p-val < .0001
##
## Model Results:
##
## estimate se zval pval
## intrcpt 0.4149 0.1013 4.0942 <.0001
## Xi -0.0782 0.0120 -6.4918 <.0001
## ci.lb ci.ub
## intrcpt 0.2163 0.6135 ***
## Xi -0.1018 -0.0546 ***
##
## Test for Selection Model Parameters:
## LRT(df = 2) = 1.7814, p-val = 0.4104
##
## Selection Model Results:
##
## k estimate se zval
## 0 < p <= 0.025 59 1.0000 --- ---
## 0.025 < p <= 0.5 23 0.6990 0.2307 -1.3049
## 0.5 < p <= 1 18 0.5027 0.2743 -1.8132
## pval ci.lb ci.ub
## 0 < p <= 0.025 --- --- ---
## 0.025 < p <= 0.5 0.1919 0.2470 1.1511
## 0.5 < p <= 1 0.0698 0.0000 1.0403 .
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The selmodel() function can also be used to fit selection models in which one or both of the selection parameters are fixed to user-specified values.
For example:
# Fixing lambda_1 = 0.5, lambda_2 = 0.2
fix_both <- selmodel(rma_fit, type = "step", steps = c(.025, .500),
delta = c(1, 0.5, 0.2))
# Fixing lambda_1 = 0.5, estimating lambda_2
fix_one <- selmodel(rma_fit, type = "step", steps = c(.025, .500),
delta = c(1, 0.5, NA))Write an estimation function that fits the selection model and returns estimates, standard errors, and confidence intervals for each of the model parameters \((\beta_0,\beta_1,\tau,\lambda_1,\lambda_2)\).
The
selmodel()fitting function sometimes returns errors, as in the following example:dat_sig <- dat %>% filter(Ti / si > 0) rma_pos <- rma.uni(yi = Ti, sei = si, mods = ~ Xi, data = dat_sig) # fit two-step selection model sel_fit <- selmodel(rma_pos, type = "step", steps = c(.025, .500))## Warning: One or more intervals do not contain any ## observed p-values.## Warning: One or more 'delta' estimates are (almost) equal to their lower or upper bound. ## Treat results with caution (or consider adjusting 'delta.min' and/or 'delta.max').Modify your estimation function to catch and return warnings such as these. Write code demonstrating that the function works as expected.
The
selmodel()throws warnings when the dataset contains no observations with \(p_i\) in one of the specified intervals. Modify your estimation function to set the selection probability for an interval to \(\lambda_1 = 0.1\) if there are no \(p_i\) values between .025 and .500 and to set \(\lambda_2 = 0.1\) if there are no \(p_i\) values larger than .500. Write code demonstrating that the function works as expected.