Chapter 7 Data analysis procedures

The overall aims of many simulation studies have to do with understand how a particular data-analysis procedure works or comparing the performance of multiple, competing procedures. Thus, the data-analysis procedure or procedures are the central object of study. Depending on the research question, the data-analysis procedure might be very simple—as simple as just computing a sample correlation–or it might involve a combination of several components. For example, the procedure might entail first computing a diagnostic test for heteroskedasticity and then, depending on the outcome of the test, applying either a conventional formula or a heteroskedasticity-robust formula for standard errors. As another example, a data-analysis procedure might involve using multiple imputation for missingness on key variables, then fitting a statistical model, and then generating predicted values based on the model. Also depending on the research question, we might need to create several functions that implement different estimation procedures to be compared.

In this chapter, we demonstrate how to implement data-analysis procedures in the form of R functions, which we call estimation functions, so that their performance can eventually be evaluated by repeatedly applying them to artificial data. We start by describing the high-level design of an estimation function, and illustrate with some simple examples. We then discuss approaches for writing simulations that compare multiple data analysis procedures. Next, we describe strategies for validating the coded-up estimation functions before running a full simulation. Finally, we examine methods for handling common computational problems with estimation functions, such as handling non-convergence when using maximum likelihood estimation.

7.1 Writing estimation functions

In the abstract, a function that implements an estimation procedure should have the following form:

estimate <- function(data) {

  # calculations/model-fitting/estimation procedures
  
  return(estimates)
}

The function takes a dataset as input, fits a model or otherwise calculates an estimate, possibly with associated standard errors and so forth, and returns these quantities as output. The estimates could be point estimates of parameters, standard errors, confidence intervals, p-values, predictions, or other quantities. The calculations in the body of the function should be set up to use datasets that have the same structure (i.e., same dimensions, same variable names) as the output of the corresponding function for generating simulated data. However, in principle, we should also be able to run the estimation function on real data as well.

In Chapter 5 we wrote a function called ANOVA_Welch_F() for computing \(p\)-values from two different procedures for testing equality of means in a heteroskedastic ANOVA:

ANOVA_Welch_F <- function(data) {
  anova_F <- oneway.test(x ~ group, data = data, var.equal = TRUE)
  Welch_F <- oneway.test(x ~ group, data = data, var.equal = FALSE)
  
  result <- tibble(
    ANOVA = anova_F$p.value,
    Welch = Welch_F$p.value
  )
  
  return(result)
}

Apply this function to a simulated dataset returns two p-values, one for the usual ANOVA \(F\) test (which assumes homoskedasticity) and one for Welch’s heteroskedastic \(F\) test:

sim_data <- generate_ANOVA_data(
  mu = c(1, 2, 5, 6), 
  sigma_sq = c(3, 2, 5, 1),
  sample_size = c(3, 6, 2, 4)
)
ANOVA_Welch_F(sim_data)
## # A tibble: 1 × 2
##      ANOVA  Welch
##      <dbl>  <dbl>
## 1 0.000293 0.0179

Our ANOVA_Welch_F() function is designed to work with the output of generate_ANOVA_data() in that it assumes that the grouping variable is called group and the outcome is called x. Relying on this assumption would be a poor choice if we were designing a function as part of an R package or for general-purpose use. However, because the primary use of the function is for simulation, it is reasonable to assume that the input data will always have appropriate variable names.

In Chapter 6, we looked at a data-generating function for a bivariate Poisson distribution, an example of a non-normal bivariate distribution. We might use such a distribution to understand the behavior of Pearson’s sample correlation coefficient and its normalizing transformation, known as Fisher’s \(z\)-transformation, which is equivalent to the hyperbolic arc-tangent function (atanh() in R). When the sample measurements follow a bivariate normal distribution, Fisher’s \(z\)-transformed correlation is very close to normally distributed and its standard error is simply \(1 / \sqrt{N - 3}\), and thus independent of the correlation. This makes \(z\)-transformation very useful for computing confidence intervals, which can then be back-transformed to the Pearson-\(r\) scale.

In this problem, a simple estimation function would take a dataset with two variables as input and compute the sample correlation and its \(z\)-transformation, compute confidence intervals for \(z\), and then back-transform the confidence interval end-points. Here is an implementation of these calculations:

r_and_z <- function(data) {
  
  r <- cor(data$C1, data$C2)
  z <- atanh(r)
  se_z <- 1 / sqrt(nrow(data) - 3)
  ci_z <- z + c(-1, 1) * qnorm(.975) * se_z
  ci_r <- tanh(ci_z)
  
  tibble( r = r, z = z, CI_lo = ci_r[1], CI_hi = ci_r[2] )
}

To check that the function returns a result of the expected form, we generate a small dataset using the r_bivariate_Poisson() function developed in the last chapter, then apply our estimation function to the result:

Pois_dat <- r_bivariate_Poisson(40, rho = 0.5, mu1 = 4, mu2 = 4)
r_and_z(Pois_dat)
## # A tibble: 1 × 4
##       r     z CI_lo CI_hi
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.557 0.628 0.296 0.740

Although it is a little cumbersome to do so, we could also apply the estimation function to a real dataset. Here is an example, which calculates the correlation between ratings of judicial integrity and familiarity with the law from the USJudgeRatings dataset (which is included in base R). For the function to work on this dataset, we first need to rename the relevant variables.

data(USJudgeRatings)

USJudgeRatings %>%
  dplyr::select(C1 = INTG, C2 = FAMI) %>%
  r_and_z()
## # A tibble: 1 × 4
##       r     z CI_lo CI_hi
##   <dbl> <dbl> <dbl> <dbl>
## 1 0.869  1.33 0.769 0.927

The function returns a valid result—a quite strong correlation!

It is a good practice to test out a newly-developed estimation function on real data as a check that it is working as intended. This type of test ensures that the estimation function is not using information outside of the dataset, such as by using known parameter values to construct an estimate. Applying the function to a real dataset demonstrates that the function implements a procedure that could actually be applied in real data analysis contexts.

7.2 Including Multiple Data Analysis Procedures

Many simulations involve head-to-head comparisons between more than one data-analysis procedure. As a design principle, we generally recommend writing different functions for each estimation method one is planning on evaluating. Doing so makes it easier to add in additional methods as desired or to focus on just a subset of methods. Writing separate function also leads to a code base that is flexible and useful for other purposes (such as analyzing real data). Finally (repeating one of our favorite mantras), separating functions makes debugging easier because it lets you focus attention on one thing at a time, without worrying about how errors in one area might propagate to others.

To see how this works in practice, we will return to the case study from Section 6.6, where we developed a data-generating function for simulating a cluster-randomized trial with student-level outcomes but school-level treatment assignment. Our data-generating process allowed for varying school sizes and heterogeneous treatment effects, which might be correlated with school size. Several different procedures might be used to estimate an overall average effect from a clustered experiment, including:

  • Estimating a multi-level regression model (also known as a hierarchical linear model),
  • Estimating an ordinary least squares (OLS) regression model and applying cluster-robust standard errors, or
  • Averaging the outcomes by school, then estimating a linear regression model on the mean outcomes.

All three of these methods are widely used and have some theoretical guarantees supporting their use. Education researchers tend to be more comfortable using multi-level regression models, whereas economists tend to use OLS with clustered standard errors.

We next develop estimation functions for each of these procedures. We analyze as we expect would be done in practice; even though we generated data with a school size covariate, we do not include it in our estimation functions. Each function needs to produce a point estimate, standard error, and \(p\)-value for the average treatment effect. To have data to practice on, we generate a sample dataset using a revised version of gen_cluster_RCT(), which corrects the bug discussed in Exercise 6.9.8:

dat <- gen_cluster_RCT( 
  J=16, n_bar = 30, alpha = 0.8, p = 0.5, 
  gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
  sigma2_u = 0.4, sigma2_e = 0.6
)

For the multi-level modeling strategy, there are several different existing packages that we could use. We will implement an estimator using the popular lme4 package, along with the lmerTest function for computing a \(p\)-value for the average effect. Here is a basic implementation:

analysis_MLM <- function( dat ) {
  
  M1 <- lme4::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat )
  M1_test <- lmerTest::as_lmerModLmerTest(M1)
  M1_summary <- summary(M1_test)$coefficients
  
  tibble( 
    ATE_hat = M1_summary["Z","Estimate"], 
    SE_hat = M1_summary["Z","Std. Error"], 
    p_value = M1_summary["Z", "Pr(>|t|)"] 
  )
}

The function fits a multi-level model with a fixed coefficient for the treatment indicator and random intercepts for each school. To get a p-value for the treatment coefficient, we have to convert the model into an lmerModLmerTest object and then pass it through summary(). The function outputs only the statistics in which we are interested.

Our function makes use of the lme4 and lmerTest packages. Rather than assuming that these packages will be loaded, we call relevant functions using the package name as a prefix, as in lme4::lmer(). This way, we can run the function even if we have not loaded the packages in the global environment. This approach is also preferable to loading packages inside the function itself (e.g., with require(lme4)) because calling the function does not change which packages are loaded in the global environment.

Here is a function implementing OLS regression with cluster-robust standard errors:

analysis_OLS <- function( dat, se_type = "CR2" ) {
  
  M2 <- estimatr::lm_robust( 
    Yobs ~ 1 + Z, data = dat, 
    clusters = sid,  se_type = se_type
  )
  
  tibble( 
    ATE_hat = M2$coefficients[["Z"]], 
    SE_hat = M2$std.error[["Z"]], 
    p_value = M2$p.value[["Z"]] 
  )
}

To get cluster-robust standard errors, we use the lm_robust() function from the estimatr() package, again calling only the relevant function using the package prefix rather than loading the whole package. A novel aspect of this estimation function is that it includes an additional intput argument, se_type, which allows us to control the type of standard error calculated by lm_robust(). Adding this option would let us use the same function to compute (and compare) different types of clustered standard errors for the average treatment effect estimate. We set a default option of "CR2", just like the default of lm_robust().

Sometimes an analytic procedure involves multiple steps. For example, aggregation estimator first involves collapsing the data to a school-level dataset, and then analyzing at the school level. This is fine: we just wrap all the steps in a single estimation function: from the point of view of using the function, it is a single call, no matter how complicated the process inside. Here is the code for the aggregate-then-analyze approach:

analysis_agg <- function( dat, se_type = "HC2" ) {
  
  datagg <- dplyr::summarise( 
    dat,
    Ybar = mean( Yobs ),
    n = n(),
    .by = c(sid, Z)
  )
  
  stopifnot( nrow( datagg ) == length(unique(dat$sid) ) )
  
  M3 <- estimatr::lm_robust( 
    Ybar ~ 1 + Z, data = datagg, 
    se_type = se_type 
  )
  
  tibble( 
    ATE_hat = M3$coefficients[["Z"]], 
    SE_hat = M3$std.error[["Z"]], 
    p_value = M3$p.value[["Z"]] 
  )
}

Note the stopifnot command: it will throw an error if the condition is not true. This stopifnot ensures we do not have both treatment and control students within a single school–if we did, we would have more aggregated values than school ids due to the grouping! Putting assert statements in your code like this is a good way to guarantee you are not introducing weird and hard-to-track errors. A stopifnot statement halts your code as soon as something goes wrong, rather than letting that initial wrongness flow on to further work, creating odd results that are hard to understand. Here we are protecting ourselves from strange results if, for example, we messed up our DGP code to have treatment not nested within school, or we were using data that did not actually come from a cluster randomized experiment. See Section A.6 for more.

All of our functions produce output in the same format:

analysis_MLM( dat )
## # A tibble: 1 × 3
##   ATE_hat SE_hat p_value
##     <dbl>  <dbl>   <dbl>
## 1  -0.111  0.323   0.737
analysis_OLS( dat )
## # A tibble: 1 × 3
##   ATE_hat SE_hat p_value
##     <dbl>  <dbl>   <dbl>
## 1  -0.177  0.307   0.576
analysis_agg( dat )
## # A tibble: 1 × 3
##   ATE_hat SE_hat p_value
##     <dbl>  <dbl>   <dbl>
## 1 -0.0818  0.339   0.813

Ensuring that the output of all the functions is structured in the same way will make it easy to keep the results organized once we start running multiple iterations of the simulation. If each estimation method returns a dataset with the same variables, we can simply stack the results on top of each other. Here is a function that bundles all the estimation procedures together:

estimate_Tx_Fx <- function(
    data, 
    CR_se_type = "CR2", agg_se_type = "HC2"
) {
  
  dplyr::bind_rows(
    MLM = analysis_MLM( dat ),
    OLS = analysis_OLS( dat, se_type = CR_se_type),
    agg = analysis_agg( dat, se_type = agg_se_type),
    .id = "estimator"
  )
  
}
estimate_Tx_Fx(dat)
## # A tibble: 3 × 4
##   estimator ATE_hat SE_hat p_value
##   <chr>       <dbl>  <dbl>   <dbl>
## 1 MLM       -0.111   0.323   0.737
## 2 OLS       -0.177   0.307   0.576
## 3 agg       -0.0818  0.339   0.813

This is a common coding pattern for simulations that involve multiple estimation procedures. Each procedure is expressed in its own function, then these are assembled together in a single function so that they can all easily be applied to the same dataset. Stacking the results row-wise will make it easy to compute performance measures for all methods at once. The benefit of stacking will become even more evident once we are working across multiple replications of the simulation process, as we will in Chapter 8.

7.3 Validating an Estimation Function

Just as with data-generating functions, it is critical to verify the accuracy of an implemented estimation function. If an estimation function involves a known procedure that has been implemented in R or one of its contributed packages, then a straightforward way to do this is to compare your implementation to another existing implementation. For estimation functions that involve multi-step procedures or novel methods, other approaches to verification may be needed, which rely more on statistical theory.

7.3.1 Checking against existing implementations

For our Welch test function, we can check the output of ANOVA_Welch_F() against the built-in oneway.test function. Let’s do that with a fresh set of data:

sim_data <- generate_ANOVA_data(
  mu = c(1, 2, 5, 6), 
  sigma_sq = c(3, 2, 5, 1),
  sample_size = c(3, 6, 2, 4)
)

aov_results <- oneway.test(x ~ factor(group), data = sim_data, 
                           var.equal = FALSE)
aov_results
## 
##  One-way analysis of means (not assuming
##  equal variances)
## 
## data:  x and factor(group)
## F = 28.367, num df = 3.0000, denom df =
## 3.3253, p-value = 0.007427
Welch_results <- ANOVA_Welch_F(sim_data)
all.equal(aov_results$p.value, Welch_results$Welch)
## [1] TRUE

We use all.equal() because it will check equality up to a tolerance in R, which can avoid some perplexing errors due to rounding.

For the bivariate correlation example, we can check the output of r_and_z() against R’s built-in cor.test() function:

Pois_dat <- r_bivariate_Poisson(15, rho = 0.6, mu1 = 14, mu2 = 8)

my_result <- r_and_z(Pois_dat) |> subset(select = -z)
my_result
## # A tibble: 1 × 3
##       r   CI_lo CI_hi
##   <dbl>   <dbl> <dbl>
## 1 0.503 -0.0125 0.807
R_result <- cor.test(~ C1 + C2, data = Pois_dat)
R_result <- tibble(r = R_result$estimate[["cor"]], 
                   CI_lo = R_result$conf.int[1], 
                   CI_hi = R_result$conf.int[2])
R_result
## # A tibble: 1 × 3
##       r   CI_lo CI_hi
##   <dbl>   <dbl> <dbl>
## 1 0.503 -0.0125 0.807
all.equal(R_result, my_result)
## [1] TRUE

This type of test is even more useful here because r_and_z() uses our own implementation of the confidence interval calculations, rather than relying on R’s built-in functions as we did with ANOVA_Welch_F().

7.3.2 Checking novel procedures

Simulations are usually an integral part of projects to develop novel statistical methods. Checking estimation functions in such projects presents a challenge: if an estimation procedure truly is new, how do you check that your code is correct? Effective methods for doing so will vary from problem to problem, but an over-arching strategy is to use theoretical results about the performance of the estimator to check that your implementation works as expected. For instance, we might work out the algebraic properties of an estimator for a special case and then check that the result of the estimation function agrees with our algebra. For some estimation problems, we might be able to identify theoretical properties of an estimator when applied to a very large sample of data and when the model is correctly specified. If we can find results about large-sample behavior, then we can test an estimation function by applying it to a very large sample and checking whether the resulting estimates are very close to specified parameter values. We illustrate each of these approaches using our functions for estimating treatment effects from cluster-randomized trials.

We start by testing an algebraic property. With each of the three methods we have implemented, the treatment effect estimator is a difference between the weighted average of the outcomes from students in each treatment condition; the only difference between the estimators is in what weights are used. In the special case where all schools have the same number of students, the weights used by all three methods end up being the same: all three methods allocate equal weight to each school. Therefore, we know that there should be no difference between the three point estimates. Furthermore, a bit of algebra will show that the cluster-robust standard error from the OLS approach will end up being identical to the robust standard error from the aggregation approach. If there are also equal numbers of schools assigned to both conditions, then the standard error from the multilevel model will also be identical to the other standard errors.

Let’s verify that our estimation functions produce results that are consistent with these theoretical properties. To do so, we will need to generate a dataset with equal cluster sizes, setting \(\alpha = 0\):

dat <- gen_cluster_RCT( 
  J=12, n_bar = 30, alpha = 0, p = 0.5, 
  gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
  sigma2_u = 0.4, sigma2_e = 0.6
)
table(dat$sid) # verify equal-sized clusters
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 30 30 30 30 30 30 30 30 30 30 30 30
estimate_Tx_Fx(dat)
## # A tibble: 3 × 4
##   estimator ATE_hat SE_hat p_value
##   <chr>       <dbl>  <dbl>   <dbl>
## 1 MLM        -0.525  0.366   0.183
## 2 OLS        -0.525  0.366   0.183
## 3 agg        -0.525  0.366   0.183

All three methods yield identical results. Now let’s try equal school sizes but unequal allocation to treatment:

dat <- gen_cluster_RCT( 
  J=12, n_bar = 30, alpha = 0, p = 2 / 3, 
  gamma_0 = 0, gamma_1 = 0.2, gamma_2 = 0.2,
  sigma2_u = 0.4, sigma2_e = 0.6
)
estimate_Tx_Fx(dat)
## # A tibble: 3 × 4
##   estimator ATE_hat SE_hat p_value
##   <chr>       <dbl>  <dbl>   <dbl>
## 1 MLM         0.189  0.547   0.737
## 2 OLS         0.189  0.413   0.663
## 3 agg         0.189  0.413   0.657

As expected, all three point estimators match, but the SE from the multilevel model is a little bit discrepant from the others.

We can also use large-sample theory to check the multilevel modeling estimator. If the model is correctly specified, then all the parameters of the model should be accurately estimated if the model is fit to a very large sample of data. To check this property, we will need access to the full model output, not just the selected results returned by analysis_MLM(). One way to handle this is to make a small tweak to the estimation function, adding an option to control whether to return the entire model or just selected results. Here is the tweaked function:

analysis_MLM <- function( dat, all_results = FALSE) {
  
  M1 <- lme4::lmer( Yobs ~ 1 + Z + (1 | sid), data = dat )
  M1_test <- lmerTest::as_lmerModLmerTest(M1)
  
  if (all_results) {
    return(summary(M1_test))
  } 
  
  M1_summary <- summary(M1_test)$coefficients
  
  tibble( 
    ATE_hat = M1_summary["Z","Estimate"], 
    SE_hat = M1_summary["Z","Std. Error"], 
    p_value = M1_summary["Z", "Pr(>|t|)"] 
  )
}

Setting all_results to TRUE will return the entire function; keeping it at the default value of FALSE will return the same output as the other functions. Now let’s apply the estimation function to a very large dataset, with variation in cluster sizes. We set gamma_2 = 0 so that the estimation model is correctly specified:

dat <- gen_cluster_RCT( 
  J=5000, n_bar = 20, alpha = 0.9, p = 2 / 3, 
  gamma_0 = 2, gamma_1 = 0.30, gamma_2 = 0,
  sigma2_u = 0.4, sigma2_e = 0.6
)

analysis_MLM(dat, all_results = TRUE)
## Linear mixed model fit by REML. t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: Yobs ~ 1 + Z + (1 | sid)
##    Data: dat
## 
## REML criterion at convergence: 242678
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.3080 -0.6581  0.0007  0.6594  4.2134 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  sid      (Intercept) 0.3972   0.6302  
##  Residual             0.6030   0.7765  
## Number of obs: 98772, groups:  sid, 5000
## 
## Fixed effects:
##               Estimate Std. Error         df
## (Intercept)    2.01677    0.01636 4968.87446
## Z              0.28378    0.02003 4963.88795
##             t value Pr(>|t|)    
## (Intercept)  123.29   <2e-16 ***
## Z             14.17   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##   (Intr)
## Z -0.817

The intercept and treatment effect coefficient estimates are very close to their true parameter values, as are the estimated school-level variance and student-level residual variance. This all gives some assurance that the analysis_MLM() function is working properly.

Of course, it is important to bear in mind that these tests are only partial verifications. With the algebraic test, we’ve checked that the functions seem to be working properly for scenarios with equal school sizes, but they still might have errors that only appear when school sizes vary. Likewise, analysis_MLM() seems to be working properly for very large datasets, but our test does not rule out the possibility of bugs that only crawl out when \(J\) is small. Our large-sample test also relies on the correctness of the gen_cluster_RCT() function; if we had seen a discrepancy between parameters and estimates from the multilevel model, it could have been because of a problem with the data-generation function rather than with the estimation function.

These limitations are typical of what can be accomplished through tests based on theoretical results, because theoretical results typically only hold under specific conditions. After all, if we had comprehensive theoretical results, we would not need to simulate anything in the first place! Nonetheless, it is good to work through such tests to the extent that relevant theory is available for the problem you are studying.

7.3.3 Checking with simulations

Checking, debugging, and revising should not be limited to when you are initially developing estimation functions. It often happens that later steps in the process of conducting a simulation will reveal problems with the code for earlier steps. For instance, once you have run the data-generating and estimation steps repeatedly, calculated performance summaries, and created some graphs of the results, you might find an unusual or anomolous pattern in the performance of an estimator. This might be a legitimate result—it might be that the estimator really does behave weirdly or not work well—or it might be due to a problem in how you implemented the estimator or data-generating process. When faced with an unusual pattern, we recommend revisiting the estimation code to double check for bugs and also thinking further about what might lead to the anomoly. Further exploration might lead you to a deeper understanding of how a method works and perhaps even an idea for how to improve the estimator or refine the data-generating process.

A good illustration of this process comes from one of Luke’s past research projects (see Pashley, Keele, and Miratrix (2024)), in which he and other co-authors were working on a way to improve Instrumental Variable (IV) estimation using post-stratification. The method they studied involved grouping units based on a covariate that predicts compliance status, then calculating estimates within each group, then summarizing the estimates across groups. They used simulations to see whether this method would improve the accuracy of the overall summary effect estimate. In the first simulation, the estimates were full of NAs and odd results because the estimation function failed to account for what happens in groups of observations where the number of compliers was estimated to be zero. After repairing that problem and re-running everything, the simulation results still indicated serious and unexpected bias, which turned out to be due to an error in how the estimation function implemented the step of summarizing estimates across groups. After again correcting and re-running, the simulation results showed that the gains in accuracy from this new method were minimal, even when the groups were formed based on a variable that was almost perfectly predictive of compliance status. Eventually, we understood that the groups with very few compliers produced such unstable estimates that they spoiled the overall average estimate. This inspired us to revise our estimation strategy and introduce a method that dropped or down-weighted strata with few compliers, which ultimately helped us to strengthen the contribution of our work.

As this experience highlights, simulations seldom follow a single, well-defined trajectory. The point of conducting simulations is to help us, as researchers, learn about estimation methods so that we can analyze real data better. What we learn from simulation gives us a better understanding of the methods (potentially including a better understanding of theoretical results), leading to ideas about better methods to create new scenarios to explore in further simulations. Of course, at some point one needs to step off this merry-go-round, write up the findings, cook dinner, and clean the bathroom. But, just like many other research endeavors, simulations follow a highly iterative process.

7.4 Handling errors, warnings, and other hiccups

Especially when working with more advanced estimation methods, it is possible that your estimation function will fail, throw an error, or return something uninterpretable for certain input datasets. For instance, maximum likelihood estimation often requires iterative, numerical optimization algorithms that sometimes fail to converge. This might happen rarely enough that it takes a while to even notice that it is a problem, but even quite rare things can occur when you run simulations with many thousands of repetitions. Less dire but still annoying, your estimation function might generate warnings, which can pile up if you are running many repetitions. In some cases, such warnings might also signal that the estimator produced a bad result, and it may not be clear whether we should retain this result (or include it in overall performance assessments). After all, the function tried to warn us that something is off!

Errors and warnings in estimation functions pose two problems, one purely technical and one conceptual. On a technical level, R functions stop running if they hit errors (though not warnings), so we need ways to handle the errors in order to get our simulations up and running. On a conceptual level, we need to decide how to use the information contained in errors and warnings, whether that be by further elaborating the estimation procedures to address different contingencies or by evaluating the performance of the estimators in a way that appropriately accounts for errors. We consider each of the problems here, then revisit the conceptual considerations in Chapter 9.

7.4.1 Capturing errors and warnings

Some estimation functions will require complicated or stochastic calculations that can sometimes produce errors. Intermittent errors can really be annoying and time-consuming if not addressed. To protect yourself, it is good practice to anticipate potential errors, preventing them from stopping code execution and allowing your simulations to keep running. We will demonstrate some techniques for error-handling using tools from the purrr package.

For illustrative purposes, consider the following error-prone function that sometimes returns what we want, sometimes returns NaN due to taking the square root of a negative number, and sometimes crashes completely because broken_code() does not exist:

my_complex_function = function( param ) {
    
    vals = rnorm( param, mean = 0.5 )
    if ( sum( vals ) > 5 ) {
        broken_code( 4 )
    } else {
        sqrt( sum( vals ) * sign( vals )[[1]] )
    }
}

Running it produces some results and an occasional warning, and some errors:

set.seed(3)
my_complex_function( 1 )
## [1] 0.6796568
my_complex_function( 10 )
## [1] 2.132087
my_complex_function( 5 )
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## [1] NaN

Running it many times produces warnings, then an error:

resu <- replicate(20, my_complex_function( 7 ))
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Error in broken_code(4): could not find function "broken_code"

The purrr package includes a function called safely that makes it easy to trap errors. To use it, we feed the estimation function into safely() to create a new version:

my_safe_function <- safely( my_complex_function, otherwise = NA )
my_safe_function( 7 )
## $result
## [1] 2.175561
## 
## $error
## NULL

The safe version of the function returns a list with two entries: the result (or NULL if there was an error), and the error message (or NULL if there was no error). safely() is an example of a functional (or an abverb), which takes a function and returns a new function that does something slightly different. We include otherwise = NA so we always get a result, rather than a NULL when there is an error.

We can use the safe function repeatedly and it will always return a result:

resu <- replicate(20, my_safe_function( 7 ), simplify = FALSE)
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
resu <- transpose( resu )
unlist(resu$result)
##  [1] 1.3870195 0.5638654        NA 1.6995292
##  [5]        NA        NA       NaN 2.2121710
##  [9]        NA       NaN       NaN 1.8801925
## [13]       NaN 1.9154618       NaN        NA
## [17] 2.2245636        NA        NA 0.8854747

The transpose() function takes a list of lists, and reorganizes them to give you a list of all the first elements, a list of all the second elements, etc. This is very powerful for wrangling data, because then we can make a tibble with list columns as so:

tb <- tibble( result = unlist( resu$result ), error = resu$error )
head( tb, n = 4 )
## # A tibble: 4 × 2
##   result error     
##    <dbl> <list>    
## 1  1.39  <NULL>    
## 2  0.564 <NULL>    
## 3 NA     <smplErrr>
## 4  1.70  <NULL>

The purrr package includes several other functionals that are useful for handling errors and warnings. The possibly() wrapper will try to run a function and will return a specified value in the event of an error:

my_possible_function <- possibly( my_complex_function, otherwise = NA )
my_possible_function( 7 )
## [1] 1.506734
rs <- replicate(20, my_possible_function( 7 ))
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
## Warning in sqrt(sum(vals) * sign(vals)[[1]]):
## NaNs produced
rs
##  [1]        NA 0.7023915 0.9495728        NA
##  [5]        NA 1.8784947 0.9838187        NA
##  [9]        NA 1.7676572 1.4809897        NA
## [13] 2.1019082        NA       NaN       NaN
## [17] 1.8629289 1.3467997       NaN 1.3017348

It works as a simpler version of safely(), which does not record error messages.

The quietly functional leads to results that are bundled together with any console output, warnings, and messages, rather than printing anything to the console:

my_quiet_function <- quietly( my_complex_function )

my_quiet_function( 1 )
## $result
## [1] 0.1724504
## 
## $output
## [1] ""
## 
## $warnings
## character(0)
## 
## $messages
## character(0)

This can be especially useful to reduce extraneous printing in a simulation, which can slow down code execution more than you might expect. However, quietly() does not trap errors:

rs <- replicate(20, my_quiet_function( 7 ))
## Error in broken_code(4): could not find function "broken_code"

Double-wrapping your function will handle both errors and warnings, but the structure it produces gets a bit complicated:

my_safe_quiet_function <- quietly( safely( my_complex_function, otherwise = NA ) )
my_safe_quiet_function(7)
## $result
## $result$result
## [1] NA
## 
## $result$error
## <simpleError in broken_code(4): could not find function "broken_code">
## 
## 
## $output
## [1] ""
## 
## $warnings
## character(0)
## 
## $messages
## character(0)

Even though the result is a bit of a mess, this structure provides all the pieces that we need to do further calculations on the result (when available), along with errors, warnings, and other output.

To see how this works in practice, we will adapt our analysis_MLM() function, which makes use of lmer() for fitting a multilevel model. Currently, the estimation function sometimes prints messages to the console:

set.seed(101012)  # (I picked this to show a warning.)
dat <- gen_cluster_RCT( J = 50, n_bar = 100, sigma2_u = 0 )
mod <- analysis_MLM(dat)
## boundary (singular) fit: see help('isSingular')

Wrapping lmer() with quietly() makes it possible to catch such output and store it along with other results, as in the following:

quiet_lmer <- quietly(lme4::lmer)
qmod <- quiet_lmer( Yobs ~ 1 + Z + (1|sid), data=dat )
qmod
## $result
## Linear mixed model fit by REML ['lmerMod']
## Formula: Yobs ~ 1 + Z + (1 | sid)
##    Data: ..2
## REML criterion at convergence: 14026.44
## Random effects:
##  Groups   Name        Std.Dev.
##  sid      (Intercept) 0.0000  
##  Residual             0.9828  
## Number of obs: 5000, groups:  sid, 50
## Fixed Effects:
## (Intercept)            Z  
##   -0.013930    -0.008804  
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings 
## 
## $output
## [1] ""
## 
## $warnings
## character(0)
## 
## $messages
## [1] "boundary (singular) fit: see help('isSingular')\n"

However, the lmerTest package does not like the structure of the results, and produces an error:

lmerTest::as_lmerModLmerTest(qmod$result)
## Error in lmerTest::as_lmerModLmerTest(qmod$result): Unable to extract deviance function from model fit

We can side-step this by combining as_lmerModLmerTest() and lmer() into a single function. While we are at it, we also layer on summary(). To do so, we use the compose() functional from purrr, which takes a list of functions and wraps them into one:

lmer_with_test <- purrr::compose(
  summary,
  lmerTest::as_lmerModLmerTest, 
  lme4::lmer
)

The resulting lmer_with_test() function acts as if we were calling lmer(), then feeding the result into as_lmerModLmerTest(), then feeding the result into summary(). We then wrap the combination function with safely() and quietly():

quiet_safe_lmer <- purrr::quietly(purrr::safely(lmer_with_test))

Now we can use our suitably quieted and safe function in a new version of the estimation function:

analysis_MLM_safe <- function( dat, all_results = FALSE ) {
  
  M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data=dat )
  
  if (all_results) {
    return(M1)
  } 
  
  message <- ifelse( length( M1$message ) > 0, M1$message, NA_character_ )
  warning <- ifelse( length( M1$warning ) > 0, M1$warning, NA_character_ )
  error <- ifelse( length( M1$result$error) > 0, M1$result$error$message, NA_character_ )
  
  tibble( 
    ATE_hat = M1$result$result$coefficients["Z","Estimate"], 
    SE_hat = M1$result$result$coefficients["Z","Std. Error"], 
    p_value = M1$result$result$coefficients["Z", "Pr(>|t|)"],
    message = message,
    warning = warning,
    error = error
  )
}

This quiet version runs without extraneous messages:

mod <- analysis_MLM_safe(dat)
mod
## # A tibble: 1 × 6
##    ATE_hat SE_hat p_value message    warning error
##      <dbl>  <dbl>   <dbl> <chr>      <chr>   <chr>
## 1 -0.00880 0.0278   0.751 "boundary… <NA>    <NA>

Now we have the estimation results along with any diagnostic information from messages or warnings. Storing this information will let us evaluate what proportion of the time there was a warning or message, run additional analyses on the subset of replications where there was no such warning, or even modify the estimation procedure to take the diagnostics into account.

7.4.2 Adapting estimation procedures for errors and warnings

So far, we have seen techniques for handling technical hiccups that occur when data analysis procedures do not always produce results. But how do we account for the absence of results in a simulation? In Chapter 9, we will delve into the conceptual issues with summarizing the performance of methods that do not always provide an answer. One of the best solutions to such problems still concerns the formulation of estimation functions, and so we introduce it here. That solution is to re-define the estimator to include contingencies for handling lack of results.

Consider a data analyst who was planning to apply a fancy statistical model to their data, but then finds that the model does not converge. What would that analyst do in practice (besides cussing and taking a snack break)? Rather than giving up entirely, they would probably think of an alternative analysis and attempt to apply it, perhaps by simplifying the model in some way. To the extent that we can anticipate such possibilities, we can build these error-contingent alternative analyses into our estimation function.

To illustrate, let’s look at an error (a not-particularly-subtle one) that can crop up in the cluster-randomized trial example when clusters are very small:

set.seed(65842)
tiny_dat <- gen_cluster_RCT( J = 10, n_bar = 2, alpha = 0.5)
analysis_MLM_safe(tiny_dat)
## # A tibble: 1 × 3
##   message warning error                           
##   <chr>   <chr>   <chr>                           
## 1 <NA>    <NA>    number of levels of each groupi…
table(tiny_dat$sid)
## 
##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

The error occurs because all 10 simulated schools include a single student, making it impossible to estimate a random-intercepts multilevel model. A natural fall-back analysis here would be to estimate an ordinary least squares regression analysis.

Suppose that our imaginary analyst is not especially into nuance, and so will fall back onto ordinary least squares whenever the multilevel model produces an error. We can express this logic in our estimation function by first catching the error thrown by lmer() and then running an OLS regression in the event an error is thrown:

analysis_MLM_contingent <- function( dat, all_results = FALSE ) {
  
  M1 <- quiet_safe_lmer( Yobs ~ 1 + Z + (1 | sid), data=dat )
  
  if (all_results) {
    return(M1)
  } 
  
  if (!is.null(M1$result$result)) { 
    # If lmer() returns a result
    res <- tibble( 
      ATE_hat = M1$result$result$coefficients["Z","Estimate"], 
      SE_hat = M1$result$result$coefficients["Z","Std. Error"], 
      p_value = M1$result$result$coefficients["Z", "Pr(>|t|)"],
    )
  } else {
    # If lmer() errors, fall back on OLS
    M_ols <- summary(lm(Yobs ~ Z, data = dat))
    res <- tibble( 
      ATE_hat = M_ols$coefficients["Z","Estimate"], 
      SE_hat = M_ols$coefficients["Z", "Std. Error"], 
      p_value = M_ols$coefficients["Z","Pr(>|t|)"]
    )
  }

  # Store original messages, warnings, errors  
  res$message <- ifelse( length( M1$message ) > 0, M1$message, NA_character_ )
  res$warning <- ifelse( length( M1$warning ) > 0, M1$warning, NA_character_ )
  res$error <- ifelse( length( M1$result$error) > 0, M1$result$error$message, NA_character_ )
  
  return(res)
}

We still store the messages, warnings, and errors from the initial lmer() fit so that we can keep track of how often errors occur. The function now returns an treatment effect estimate even if lmer() errors:

analysis_MLM_contingent(tiny_dat)
## # A tibble: 1 × 6
##   ATE_hat SE_hat p_value message warning error    
##     <dbl>  <dbl>   <dbl> <chr>   <chr>   <chr>    
## 1   0.400  0.603   0.525 <NA>    <NA>    number o…

Of course, we can easily anticipate the conditions under which this particular error will occur: all we need to do is check whether all the clusters are single observations. Because it is easily anticipated, a better strategy for handling this error is to check before fitting the multilevel model and proceeding accordingly in the event that the clusters are all singletons. Exercise 7.5.5 asks you to implement this approach and further refine this contingent analysis strategy.

Adapting estimation functions to address errors can be an effective—and often very interesting—strategy for studying the performance of estimation methods. Rather than studying the performance of a data-analysis method that is only sometimes well-defined, we shift to studying a stylized cognitive model for the analyst’s decision-making process, which handles contingencies that might crop up whether analyzing simulated data or real empirical data. Of course, studying such a model is only interesting to the extent that the decision-making process it implements is a plausible representation of what an analyst might actually do in practice.

The adaptive estimation approach does lead to more complex estimation functions, which entail implementing multiple estimation methods and a set of decision rules for applying them. Often, the set of contingencies that need to be handled will not be immediately obvious, so you may find that you need to build and refine the decision rules as you learn more about how they work. Running an estimation procedure over multiple, simulated datasets is an excellent (if aggravating!) way to identify errors and edge cases. We turn to procedures for doing so in the next chapter.

7.5 Exercises

7.5.1 More Heteroskedastic ANOVA

In the classic simulation by Brown and Forsythe (1974), they not only looked at the performance of the homoskedastic ANOVA-F test and Welch’s heteroskedastic-F test, they also proposed their own new hypothesis testing procedure.

  1. Write a function that implements the Brown-Forsythe F* test (the BFF* test!) as described on p. 130 of Brown and Forsythe (1974), using the following code skeleton:

    BF_F <- function( data ) {
    
      # fill in the guts here
    
      return(pval)
    }

    Run the following code to check that your function produces a result:

    sim_data <- generate_ANOVA_data(
      mu = c(1, 2, 5, 6), 
      sigma_sq = c(3, 2, 5, 1),
      sample_size = c(3, 6, 2, 4)
    )
    BF_F( sim_data )
  2. Try calling your BF_F function on a variety of datasets of different sizes and shapes, to make sure it works. What kinds of datasets should you test out?

  3. Add the BFF* test into the output of Welch_ANOVA_F() by calling your BF_F() function inside the body of Welch_ANOVA_F().

  4. The onewaytests package implements a variety of different hypothesis testing procedures for one-way ANOVA. Validate your Welch_ANOVA_F() function by comparing the results to the output of the relevant functions from onewaytests.

7.5.2 Contingent testing

In the one-way ANOVA problem, one approach that an analyst might think to take is to conduct a preliminary significance test for heterogeneity of variances (such as Levene’s test or Bartlett’s test), and then report the \(p\)-value from the homoskedastic ANOVA F test if variance heterogeneity is not detected but the \(p\)-value from the BFF* test if variance heteogeneity is detected. Modify the Welch_ANOVA_F() function to return the \(p\)-value from this contingent BFF* test in addition to the \(p\)-values from the (non-contingent) ANOVA-F, Welch, and BFF* tests. Include an input option that allows the user to control the \(\alpha\) level of the preliminary test for heterogeneity of variances, as in the following skeleton.

Welch_ANOVA_F <- function( data , pre_alpha = .05) {
  # preliminary test for variance heterogeneity
  # compute non-contingent F tests for group differences
  # compute contingent test
  # compile results
  return(result)
}

7.5.3 Check the cluster-RCT functions

Section 7.2 presented functions implementing several different strategies for estimating an average treatment effect from a cluster-randomized trial. Write some code to validate these functions by comparing their output to the results of other tools for doing the same calculation. Use one or more datasets simulated with gen_cluster_RCT(). For each of these tests, you will need to figure out the appropriate syntax by reading the package documentation.

  1. For analysis_MLM(), check the output by fitting the same model using lme() from the nlme() package or glmmTMB() from the package of the same name.

  2. For analysis_OLS(), check the output by fitting the linear model using the base R function lm(), then computing standard errors using vcovCL() from the sandwich package. Also compare the output to the results of feeding the fitted model through coef_test() from the clubSandwich package.

  3. For analysis_agg(), check the output by aggregating the data to the school-level, fitting the linear model using lm(), and computing standard errors using vcovHC() from the sandwich package.

7.5.4 Extending the cluster-RCT functions

Exercise 6.9.10 from Chapter 6 asked you to extend the data-generating function for the cluster-randomized trial to include generating a student-level covariate, \(X\), that is predictive of the outcome. Use your modified function to generate a dataset.

  1. Modify the estimation functions from Section 7.2 to use models that include the covariate as a predictor.

  2. Further extend the functions to include an input argument for the set of predictors to be included in the model, as in the following skeleton for the multi-level model estimator:

    analysis_MLM <- function(dat, predictors = "Z") {
    
    }
    
    analysis_MLM( dat )
    analysis_MLM( dat, predictors = c("Z","X"))
    analysis_MLM( dat, predictors = c("Z","X", "X:Z"))

    Hint: Check out the reformulate() function, which makes it easy to build formulas for different sets of predictors.

7.5.5 Contingent estimator processing

In Section 7.4.2 we developed a version of analysis_MLM() that fell back on OLS regression in the event that lmer() produced any error. The implementation that we demonstrated is not especially smart. For one, it does not anticipate that the error will occur. For another, if we are using this function as one of several estimation strategies, it will require fitting the OLS regression multiple times. Can you fix these problems?

  1. Revise analysis_MLM_contingent() so that it checks whether all clusters are single observations before fitting the multilevel model. Handle event the contingency where all clusters are singletons by skipping the model-fitting step, returning NA values for the point estimator, standard error, and p-value, and returning an informative message in the error variable. Test that the function is working as expected.

  2. Revise estimate_Tx_Fx() to use your new version of analysis_MLM_contingent(). The revised function will sometimes return NA values for the MLM results. To implement the strategy of falling-back on OLS regression, add some code that replaces any NA values with corresponding results of analysis_OLS(). Test that the function is working as expected.

7.5.6 Estimating 3-parameter item response theory models

Exercise 6.9.11 asked you to write a data-generating function for the 3-parameter IRT model described in described in Section 6.7. Use your function to generate a large dataset. Using functions from the {ltm}, {mirt}, or {TAM} packages, estimate the parameters of the 3-parameter IRT model based on the simulated dataset.

  1. Write a function to estimate a 3-parameter IRT model and return a dataset containing estimates of the item characteristics \((\alpha_m,\beta_m, \gamma_m)\).

  2. Add an option to the function to allow the user to specify known values of \(\gamma_m\).

  3. Create a graphic showing how the item parameter estimates compare to the true item characteristics.

  4. Write a function or set of functions to apply 1-parameter, 2-parameter, and 3-parameter models and return datasets containing the person ability estimates \(\theta_1,...,\theta_N\) and corresponding standard errors of measurement.

7.5.7 Meta-regression with selective reporting

Exercise 6.9.15 asked you to write a data-generating function for the Vevea and Hedges (1995) selection model. The {metafor} package includes a function for fitting this model (as well as a variety of other selection models). Here is an example of the syntax for estimating this model, using a dataset from the {metadat} package:

library(metafor)
data("dat.assink2016", package = "metadat")

# rename variables and tidy up
dat <- 
  dat.assink2016 %>%
  mutate( si = sqrt(vi) ) %>%
  rename( Ti = yi, s_sq = vi, Xi = year )

# fit a random effects meta-regression model
rma_fit <- rma.uni(yi = Ti, sei = si, mods = ~ Xi, data = dat)
# fit two-step selection model
selmodel(rma_fit, type = "step", steps = c(.025, .500))
## 
## Mixed-Effects Model (k = 100; tau^2 estimator: ML)
## 
## tau^2 (estimated amount of residual heterogeneity): 0.2314 (SE = 0.0500)
## tau (square root of estimated tau^2 value):         0.4810
## 
## Test for Residual Heterogeneity:
## LRT(df = 1) = 349.3718, p-val < .0001
## 
## Test of Moderators (coefficient 2):
## QM(df = 1) = 42.1433, p-val < .0001
## 
## Model Results:
## 
##          estimate      se     zval    pval 
## intrcpt    0.4149  0.1013   4.0942  <.0001 
## Xi        -0.0782  0.0120  -6.4918  <.0001 
##            ci.lb    ci.ub      
## intrcpt   0.2163   0.6135  *** 
## Xi       -0.1018  -0.0546  *** 
## 
## Test for Selection Model Parameters:
## LRT(df = 2) = 1.7814, p-val = 0.4104
## 
## Selection Model Results:
## 
##                      k  estimate      se     zval 
## 0     < p <= 0.025  59    1.0000     ---      --- 
## 0.025 < p <= 0.5    23    0.6990  0.2307  -1.3049 
## 0.5   < p <= 1      18    0.5027  0.2743  -1.8132 
##                       pval   ci.lb   ci.ub    
## 0     < p <= 0.025     ---     ---     ---    
## 0.025 < p <= 0.5    0.1919  0.2470  1.1511    
## 0.5   < p <= 1      0.0698  0.0000  1.0403  . 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The selmodel() function can also be used to fit selection models in which one or both of the selection parameters are fixed to user-specified values. For example:

# Fixing lambda_1 = 0.5, lambda_2 = 0.2
fix_both <- selmodel(rma_fit, type = "step", steps = c(.025, .500), 
                     delta = c(1, 0.5, 0.2))

# Fixing lambda_1 = 0.5, estimating lambda_2
fix_one <- selmodel(rma_fit, type = "step", steps = c(.025, .500), 
                    delta = c(1, 0.5, NA))
  1. Write an estimation function that fits the selection model and returns estimates, standard errors, and confidence intervals for each of the model parameters \((\beta_0,\beta_1,\tau,\lambda_1,\lambda_2)\).

  2. The selmodel() fitting function sometimes returns errors, as in the following example:

    dat_sig <- dat %>% filter(Ti / si > 0)
    rma_pos <- rma.uni(yi = Ti, sei = si, mods = ~ Xi, data = dat_sig)
    # fit two-step selection model
    sel_fit <- selmodel(rma_pos, type = "step", steps = c(.025, .500))
    ## Warning: One or more intervals do not contain any
    ## observed p-values.
    ## Warning: One or more 'delta' estimates are (almost) equal to their lower or upper bound.
    ## Treat results with caution (or consider adjusting 'delta.min' and/or 'delta.max').

    Modify your estimation function to catch and return warnings such as these. Write code demonstrating that the function works as expected.

  3. The selmodel() throws warnings when the dataset contains no observations with \(p_i\) in one of the specified intervals. Modify your estimation function to set the selection probability for an interval to \(\lambda_1 = 0.1\) if there are no \(p_i\) values between .025 and .500 and to set \(\lambda_2 = 0.1\) if there are no \(p_i\) values larger than .500. Write code demonstrating that the function works as expected.

References

Pashley, Nicole E, Luke Keele, and Luke W Miratrix. 2024. “Improving Instrumental Variable Estimators with Poststratification.” Journal of the Royal Statistical Society Series A: Statistics in Society, qnae073.
Vevea, Jack L, and Larry V Hedges. 1995. “A General Linear Model for Estimating Effect Size in the Presence of Publication Bias.” Psychometrika 60 (3): 419–35. https://doi.org/10.1007/BF02294384.