Chapter 5 Case Study: Heteroskedastic ANOVA

In this chapter, we present another detailed example of a simulation study to demonstrate how to put the principles of tidy, modular simulation into practice. To illustrate the process of programming a simulation, we reconstruct the simulations from Brown and Forsythe (1974). We also use this case study as a recurring example in some of the following chapters.

Brown and Forsythe (1974) studied methods for null hypothesis testing in studies that measure a characteristic \(X\) on samples from each of several groups. They consider a population consisting of \(G\) separate groups, with population means \(\mu_1,...,\mu_G\) and population variances \(\sigma_1^2,...,\sigma_G^2\) for the characteristic \(X\). We obtain samples of size \(n_1,...,n_G\) from each of the groups, and take measurements of the characteristic for each sampled unit. Let \(x_{ig}\) denote the measurement from unit \(i\) in group \(g\), for \(i = 1,...,n_g\) for each \(j = 1,..., G\). The analyst’s goal is to use the sample data to test the hypothesis that the population means are all equal \[ H_0: \mu_1 = \mu_2 = \cdots = \mu_G. \]

If the population variances were all equal (so \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_G^2\)), we could use a conventional one-way analysis of variance (ANOVA) to test. However, one-way ANOVA might not work well if the variances are not equal. The question is then what are best practices for testing the null of equal group means, allowing for the possibility that variances could differ across groups (a form of heteroskedasticity).

To tackle this question, Brown and Forsythe evaluated two different hypothesis testing procedures, developed by James (1951) and Welch (1951), which avoid the assumption that all groups have equal equality of variances. Brown and Forsythe also evaluated the conventional one-way ANOVA F-test as a benchmark, even though this procedure maintains the assumption of equal variances. They also proposed and evaluated a new procedure of their own devising.8 Overall, the simulation involves comparing the performance of these different hypothesis testing procedures (the methods) under a range of conditions (different data generating processes) with different sample sizes and different degrees of heteroskedasticity.

When evaluating hypothesis testing procedures, there are two main performance metrics of interest: type-I error rate and power. The type-I error rate is the rate at which a test rejects the null hypothesis when the null hypothesis is actually true. To apply a hypothesis testing procedure, one has to specify a desired, or nominal, type-I error rate, often denoted as the \(\alpha\)-level. For a specified \(\alpha\), a valid or well-calibrated test should have an actual type-I error rate less than or equal to the nominal level, and ideally should be very close to nominal. Power is how often a test correctly rejects the null when it is indeed false. It is a measure of how sensitive a method is to violations of the null.

Brown and Forsythe estimated error rates and power for nominal \(\alpha\)-levels of 1%, 5%, and 10%. Table 1 of their paper reports the simulation results for type-I error (labeled as “size”). They looked at the different scenarios shown on Table 5.1, varying number of groups, group size, and amount of variation within each group. We also provide some of the results on Type I error that they reported in their Table 1 for these scenarios on Table 5.2; for a properly working method, the reported numbers should be about the desired alpha levels, listed at the top of the table. We can compare the four tests to each other across each row, where each row is a specific scenario defined by a specific data generating process. Looking at ANOVA, for example, we see some scenarios with very elevated rates–consider Scenario E, for example, with 21.9% rejection when it should only have 10%. By contrast, scenario A is fine–which is what we expect as all the groups have the same variation; it is wise to always include contexts where we expect things to work, as well as when we expect them to not work, in our simulations.

Table 5.1: Simulation scenarios explored by Brown and Forsythe (1974)
Scenario Groups Sample Sizes Standard Deviations
A 4 4,4,4,4 1,1,1,1
B 4 1,2,2,3
C 4 4,8,10,12 1,1,1,1
D 4 1,2,2,3
E 4 3,2,2,1
F 4 11,11,11,11 1,1,1,1
G 4 1,2,2,3
H 4 11,16,16,21 1,1,1,1
I 4 3,2,2,1
J 4 1,2,2,3
K 6 4,4,4,4,4,4 1,1,1,1,1,1
L 6 1,1,2,2,3,3
M 6 4,6,6,8,10,12 1,1,1,1,1,1
N 6 1,1,2,2,3,3
O 6 3,3,2,2,1,1
P 6 6,6,6,6,6,6 1,1,2,2,3,3
Q 6 11,11,11,11,11,11 1,1,2,2,3,3
R 6 16,16,16,16,16,16 1,1,2,2,3,3
S 6 21,21,21,21,21,21 1,1,2,2,3,3
T 10 20,20,20,20,20,20,20,20,20,20 1,1,1.5,1.5,2,2,2.5,2.5,3,3
Table 5.2: Portion of “Table 1” reproduced from Brown and Forsythe
ANOVA F test
B & F’s F* test
Welch’s test
James’ test
Scenario 10% 5% 1% 10% 5% 1% 10% 5% 1% 10% 5% 1%
A 10.2 4.9 0.9 7.8 3.4 0.5 9.6 4.5 0.8 13.3 7.9 2.4
B 12.0 6.7 1.7 8.9 4.1 0.7 10.3 4.7 0.8 13.8 8.1 2.7
C 9.9 5.1 1.1 9.5 4.8 1.0 10.8 5.7 1.6 12.1 6.7 2.1
D 5.9 3.0 0.6 10.3 5.7 1.4 9.8 4.9 0.9 10.8 5.6 1.3
E 21.9 14.4 5.6 11.0 6.2 1.8 11.3 6.5 2.0 12.9 7.7 2.9
F 10.1 5.1 1.0 9.8 5.7 1.5 10.0 5.0 0.9 10.6 5.5 1.1
G 11.4 6.3 1.8 10.7 5.7 1.5 10.1 5.0 1.1 10.6 5.4 1.3
H 10.3 4.9 1.1 10.3 5.1 1.0 10.2 5.0 1.0 10.5 5.3 1.2
I 17.3 10.8 3.9 11.1 6.2 1.8 10.5 5.5 1.2 10.9 5.8 1.3
J 7.3 4.0 1.0 11.5 6.5 1.8 10.6 5.4 1.1 10.9 5.6 1.1
K 9.6 4.9 1.0 7.3 3.4 0.4 11.4 6.1 1.4 14.7 9.5 3.8

To replicate the Brown and Forsythe simulation, we will first write functions to generate data for a specified scenario and evaluate data of a given structure. We will then use these functions to write a simulation to evaluate the hypothesis testing procedures in a specific scenario with a specific set of core parameters (e.g., sample sizes, number of groups, and so forth). We will finally then scale up to do a range of scenarios where we vary the parameters of the data-generating model.

5.1 The data-generating model

In the heteroskedastic one-way ANOVA simulation, there are three sets of parameter values: population means, population variances, and sample sizes. Rather than attempting to write a general data-generating function immediately, it is often easier to write code for a specific case first and then use that code as a starting point for developing a function. For example, say that we have four groups with means of 1, 2, 5, 6; variances of 3, 2, 5, 1; and sample sizes of 3, 6, 2, 4:

mu <- c(1, 2, 5, 6)
sigma_sq <- c(3, 2, 5, 1)
sample_size <- c(3, 6, 2, 4)

Following Brown and Forsythe (1974), we will assume that the measurements are normally distributed within each sub-group of the population. The following code generates a vector of group id’s and a vector of simulated measurements:

N <- sum(sample_size) # total sample size
G <- length(sample_size) # number of groups

# group id factor
group <- factor(rep(1:G, times = sample_size))

# mean for each unit of the sample
mu_long <- rep(mu, times = sample_size) 

# sd for each unit of the sample
sigma_long <- rep(sqrt(sigma_sq), times = sample_size) 

# See what we have?
tibble( group = group, mu = mu_long, sigma = sigma_long )
## # A tibble: 15 × 3
##    group    mu sigma
##    <fct> <dbl> <dbl>
##  1 1         1  1.73
##  2 1         1  1.73
##  3 1         1  1.73
##  4 2         2  1.41
##  5 2         2  1.41
##  6 2         2  1.41
##  7 2         2  1.41
##  8 2         2  1.41
##  9 2         2  1.41
## 10 3         5  2.24
## 11 3         5  2.24
## 12 4         6  1   
## 13 4         6  1   
## 14 4         6  1   
## 15 4         6  1
# Now make our data
x <- rnorm(N, mean = mu_long, sd = sigma_long)
dat <- tibble(group = group, x = x)
dat
## # A tibble: 15 × 2
##    group      x
##    <fct>  <dbl>
##  1 1      1.24 
##  2 1      3.07 
##  3 1     -0.681
##  4 2      2.43 
##  5 2      2.50 
##  6 2      2.15 
##  7 2      0.612
##  8 2      0.860
##  9 2      2.09 
## 10 3      1.56 
## 11 3      5.08 
## 12 4      5.68 
## 13 4      5.66 
## 14 4      5.92 
## 15 4      4.38

We have made a small dataset of group membership and outcome. We have followed the strategy of first constructing a dataset with parameters for each observation in each group, making heavy use of base R’s rep() function to repeat values in a list. We then called rnorm() to generate N observations in all. This works because rnorm() is vectorized; if you give it a vector or vectors of parameter values, it will generate each subsequent observation according to the next entry in the vector. As a result, the first x value is simulated from a normal distribution with mean mu_long[1] and standard deviation sd_long[1], the second x is simulated using mu_long[2] and sd_long[2], and so on.

As usual, there are many different and legitimate ways of doing this in R. For instance, instead of using rep() to do it all at once, we could generate observations for each group separately, then stack the observations into a dataset. Do not worry about trying to writing code the “best” way—especially when you are initially putting a simulation together. We strongly believe in adage that if you can do it at all, then you should feel good about yourself.

5.1.1 Now make a function

Because we will need to generate datasets over and over, we wrap our code in a function. The inputs to the function will be the parameters of the model that we specified at the very beginning: the set of population means mu, the population variances sigma_sq, and sample sizes sample_size. We make these quantities arguments of the data-generating function so that we can make datasets of different sizes and shapes:

generate_data <- function(mu, sigma_sq, sample_size) {

  N <- sum(sample_size)
  G <- length(sample_size)

  group <- factor(rep(1:G, times = sample_size))
  mu_long <- rep(mu, times = sample_size)
  sigma_long <- rep(sqrt(sigma_sq), times = sample_size)

  x <- rnorm(N, mean = mu_long, sd = sigma_long)
  sim_data <- tibble(group = group, x = x)

  return(sim_data)
}

The above code is simply the code we built previously, all bundled up. We developed the function by first writing code to make the data-generating process to work once, the way we want, and only then turning the final code into a function for later reuse.

Once we have turned the code into a function, we can call to get a new set of simulated data. For example, to generate a dataset with the same parameters as before, we can do:

sim_data <- generate_data(
  mu = mu, 
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

sim_data
## # A tibble: 15 × 2
##    group     x
##    <fct> <dbl>
##  1 1     0.777
##  2 1     2.11 
##  3 1     1.31 
##  4 2     1.85 
##  5 2     3.04 
##  6 2     1.92 
##  7 2     1.51 
##  8 2     0.226
##  9 2     0.933
## 10 3     5.92 
## 11 3     5.64 
## 12 4     3.35 
## 13 4     5.30 
## 14 4     5.50 
## 15 4     7.40

To generate one with population means of zero in all the groups, but the same group variances and sample sizes as before, we can do:

sim_data_null <- generate_data(
  mu = c( 0, 0, 0, 0 ),
  sigma_sq = sigma_sq, 
  sample_size = sample_size
)

sim_data
## # A tibble: 15 × 2
##    group     x
##    <fct> <dbl>
##  1 1     0.777
##  2 1     2.11 
##  3 1     1.31 
##  4 2     1.85 
##  5 2     3.04 
##  6 2     1.92 
##  7 2     1.51 
##  8 2     0.226
##  9 2     0.933
## 10 3     5.92 
## 11 3     5.64 
## 12 4     3.35 
## 13 4     5.30 
## 14 4     5.50 
## 15 4     7.40

Following the principles of tidy, modular simulation, we have written a function that returns a rectangular dataset for further analysis. Also note that the dataset returned by generate_data() only includes the variables group and x, but not mu_long or sd_long. This is by design. Including mu_long or sd_long would amount to making the population parameters available for use in the data analysis procedures, which is not something that happens when analyzing real data.

5.1.2 Cautious coding

In the above, we built some sample code, and then bundled it into a function by literally cutting and pasting the initial work we did into a function skeleton. In the process, we shifted from having variables in our workspace with different names to using those variable names as parameters in our function call.

Developing code in this way is not without hazards. In particular, after finishing making our function, our workspace has a variable mu in it and our function also has a parameter named mu. Inside the function, R will use the parameter mu first, but this is potentially confusing. As are, potentially, lines such as mu = mu, which means “set the function’s parameter called mu to the variable called mu.” These are different things (with the same name).

One way to check your code, once a function is built, is to comment out the initial code (or delete it), restart R, or at least clear out the workspace, and then re-run the code that uses the function. If things still work, then you should be somewhat confident that you successfully bundled your code into the function.

You can also, once you bundle your code, do a search and replace to change variable names in your function to something more generic, to make the separation more clear.

5.2 The hypothesis testing procedures

Brown and Forsythe considered four different hypothesis testing procedures for heteroskedastic ANOVA, but we will focus on just two of the tests for now. We start with the conventional one-way ANOVA that mistakenly assumes homoskedasticity. R’s oneway.test function will calculate this test automatically:

sim_data <- generate_data(
  mu = mu, 
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

anova_F <- oneway.test(x ~ group, data = sim_data, var.equal = TRUE)
anova_F
## 
##  One-way analysis of means
## 
## data:  x and group
## F = 8.9503, num df = 3, denom df = 11,
## p-value = 0.002738

We can use the same function to calculate Welch’s test by setting var.equal = FALSE:

Welch_F <- oneway.test(x ~ group, data = sim_data, var.equal = FALSE)
Welch_F
## 
##  One-way analysis of means (not assuming
##  equal variances)
## 
## data:  x and group
## F = 22.321, num df = 3.0000, denom df =
## 3.0622, p-value = 0.01399

The main results we need here are the \(p\)-values of the tests, which will let us assess Type-I error and power for a given nominal \(\alpha\)-level. The following function takes simulated data as input and returns as output the \(p\)-values from the one-way ANOVA test and Welch test:

ANOVA_Welch_F <- function(data) {
  anova_F <- oneway.test(x ~ group, data = data, var.equal = TRUE)
  Welch_F <- oneway.test(x ~ group, data = data, var.equal = FALSE)
  
  result <- tibble(
    ANOVA = anova_F$p.value,
    Welch = Welch_F$p.value
  )
  
  return(result)
}

ANOVA_Welch_F(sim_data)
## # A tibble: 1 × 2
##     ANOVA  Welch
##     <dbl>  <dbl>
## 1 0.00274 0.0140

Following our tidy, modular simulation principles, this function returns a small dataset with the p-values from both tests. Eventually, we might want to use this function on some real data. Our estimation function does not care if the data are simulated or not; we call the input data rather than sim_data to reflect this.

As an alternative to this function, we could instead write code to implement the ANOVA and Welch tests ourselves. This has some potential advantages, such as avoiding any extraneous calculations that oneway.test does, which take time and slow down our simulation. However, there are also drawbacks to doing so, including that writing our own code takes our time and opens up the possibility of errors in our code. For further discussion of the trade-offs, see Chapter 17, where we do implement these tests by hand and see what kind of speed-ups we can obtain.

5.3 Running the simulation

We now have functions that implement steps 2 and 3 of the simulation. Given some parameters, generate_data produces a simulated dataset and, given some data, ANOVA_Welch_F calculates \(p\)-values two different ways. We now want to know which way is better, and by how much. To answer this question, we next need to repeat this chain of calculations a bunch of times.

We first make a function that puts our chain together in a single method:

one_run = function( mu, sigma_sq, sample_size ) {
  sim_data <- generate_data(
    mu = mu, 
    sigma_sq = sigma_sq,
    sample_size = sample_size
  )
  ANOVA_Welch_F(sim_data)
}

one_run( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size )
## # A tibble: 1 × 2
##    ANOVA Welch
##    <dbl> <dbl>
## 1 0.0167 0.107

This function implements a single simulation trial by generating artificial data and then analyzing the data, ending with a nice dataframe or tibble that has results for the single run.

We next call one_run() over and over; see A.1 for some discussion of options. The following uses map_df to run one_run() 4 times and then stack the results into a single data frame:

sim_data <- map_df(
  1:4, 
  ~ one_run(
      mu = mu, 
      sigma_sq = sigma_sq, 
      sample_size = sample_size
    )
)
sim_data
## # A tibble: 4 × 2
##     ANOVA  Welch
##     <dbl>  <dbl>
## 1 0.0262  0.0125
## 2 0.00451 0.0698
## 3 0.00229 0.0380
## 4 0.0108  0.0423

Voila! We have simulated \(p\)-values!

5.4 Summarizing Test Performance

We now have all the pieces in place to reproduce the results from Brown and Forsythe (1974). We first focus on calculating the actual type-I error rate of these tests—that is, the proportion of the time that they reject the null hypothesis of equal means when that null is actually true—for an \(\alpha\)-level of .05. We therefore need to simulate data according to process where the population means are indeed all equal. Arbitrarily, we start with \(G = 4\) groups and set all of the means equal to zero:

mu <- rep(0, 4)

In the fifth row of Table 1 (Scenario E in our Table 5.1), Brown and Forsythe examine performance for the following parameter values for sample size and population variance:

sample_size <- c(4, 8, 10, 12)
sigma_sq <- c(3, 2, 2, 1)^2

With these parameter values, we can use map_dfr to simulate 10,000 \(p\)-values:

p_vals <- map_dfr(1:10000, 
  ~ one_run(
      mu = mu,
      sigma_sq = sigma_sq,
      sample_size = sample_size
    ) 
)

p_vals
## # A tibble: 10,000 × 2
##      ANOVA Welch
##      <dbl> <dbl>
##  1 0.111   0.131
##  2 0.383   0.750
##  3 0.406   0.818
##  4 0.00653 0.181
##  5 0.0950  0.110
##  6 0.127   0.261
##  7 0.352   0.381
##  8 0.529   0.589
##  9 0.0303  0.167
## 10 0.0716  0.190
## # ℹ 9,990 more rows

We next use our replications to calculate the rejection rates. The rule is that the null is rejected if the \(p\)-value is less than \(\alpha\). To get the rejection rate, calculate the proportion of replications where the null is rejected.

sum(p_vals$ANOVA < 0.05) / 10000
## [1] 0.1388

This is equivalent to taking the mean of the logical conditions:

mean(p_vals$ANOVA < 0.05)
## [1] 0.1388

We get a rejection rate that is much larger than \(\alpha = .05\). We have learned that the ANOVA F-test does not adequately control Type-I error under this set of conditions.

mean(p_vals$Welch < 0.05)
## [1] 0.0616

The Welch test does much better, although it appears to be a little bit in excess of 0.05.

Note that these two numbers are quite close (though not quite identical) to the corresponding entries in Table 1 of Brown and Forsythe (1974). The difference is due to the fact that both Table 1 and are results are actually estimated rejection rates, because we have not actually simulated an infinite number of replications. The estimation error arising from using a finite number of replications is called simulation error (or Monte Carlo error). Later on in Chapter 9, we will look more at how to estimate and control the Monte Carlo simulation error in performance measures.

So there you have it! Each part of the simulation is a distinct block of code, and together we have a modular simulation that can be easily extended to other scenarios or other tests. In the exercises, you will extend this framework, and get to experience first-hand how an initial structure such as this is easier to work with than a single, monolithic block of code.

5.5 Exercises

The following exercises involve exploring and tweaking the above simulation code we have developed to replicate the results of Brown and Forsythe (1974).

  1. Table 1 from Brown and Forsythe reported rejection rates for \(\alpha = .01\) and \(\alpha = .10\) in addition to \(\alpha = .05\). Calculate the rejection rates of the ANOVA F and Welch tests for all three \(\alpha\)-levels and compare to the table.

  2. Try simulating the Type-I error rates for the parameter values in the first two rows of Table 1 of the original paper. Use 10,000 replications. How do your results compare to the report results?

  3. In the primary paper, Table 1 is about Type I error and Table 2 is about power. A portion of Table 2 follows:

pow <- tribble( ~Variances, ~ Means, ~ `Brown's F`, ~ `B & F's F*`, ~`Welch's W`,
                "1,1,1,1",   "0,0,0,0", 4.9, 5.1, 5.0,
                "",          "1,0,0,0", 68.6, 67.6, 65.0,
                "3,2,2,1",   "0,0,0,0",  NA, 6.2, 5.5, 
                "",         "1.3,0,0,1.3", NA, 42.4, 68.2
                )
knitr::kable( pow )
Variances Means Brown’s F B & F’s F* Welch’s W
1,1,1,1 0,0,0,0 4.9 5.1 5.0
1,0,0,0 68.6 67.6 65.0
3,2,2,1 0,0,0,0 NA 6.2 5.5
1.3,0,0,1.3 NA 42.4 68.2

In the table, the sizes of the four groups are 11, 16, 16, and 21, for all the scenarios. Try simulating the power levels for a couple of sets of parameter values from Table @ref(tab:BF_power). Use 10,000 replications. How do your results compare to the results reported in the Table?

  1. Instead of making ANOVA_Welch_F return a single row with the columns for the \(p\)-values, one could instead return a dataset with one row for each test. The “long” approach is often nicer when evaluating more than two methods, or when each method returns not just a \(p\)-value but other quantities of interest. For our current simulation, we might also want to store the \(F\) statistic, for example. The resulting dataset would then look like the following:

    ANOVA_Welch_F_long(sim_data)
    ## # A tibble: 2 × 3
    ##   method Fstat  pvalue
    ##   <chr>  <dbl>   <dbl>
    ## 1 ANOVA   8.46 0.00338
    ## 2 Welch  14.3  0.0241

    Modify ANOVA_Welch_F() to do this, update your simulation code, and then use group_by() plus summarise() to calculate rejection rates of both tests. group_by() is a method for dividing your data into distinct groups and conducting an operation on each. The classic form of this would be something like:

    sres <- res %>% group_by( method ) %>%
      summarise( rejection_rate = mean( pvalue < 0.05 ) )
  2. The onewaytests package in R includes functions for calculating Brown and Forsythe’s \(F^*\) test and James’ test for differences in population means. Modify the data analysis function ANOVA_Welch_F (or, better yet, ANOVA_Welch_F_long from Exercise #4) to also include results from these hypothesis tests. Re-run the simulation to estimate the type-I error rate of all four tests under Scenarios A and B of Table 5.1.

References

Brown, Morton B., and Alan B. Forsythe. 1974. “The Small Sample Behavior of Some Statistics Which Test the Equality of Several Means.” Technometrics 16 (1): 129–32. https://doi.org/10.1080/00401706.1974.10489158.
James, G. S. 1951. “The Comparison of Several Groups of Observations When the Ratios of the Population Variances Are Unknown.” Biometrika 38 (3/4): 324. https://doi.org/10.2307/2332578.
Welch, B. L. 1951. “On the Comparison of Several Mean Values: An Alternative Approach.” Biometrika 38 (3/4): 330. https://doi.org/10.2307/2332579.

  1. This latter piece makes Brown and Forsythe’s study a prototypical example of a statistical methodology paper: find some problem that current procedures do not perfectly solve, invent something to do a better job, and then do simulations and/or math to build a case that the new procedure is better.↩︎