Chapter 5 Case Study: Heteroskedastic ANOVA and Welch

In this chapter, we present another detailed example of a simulation study to demonstrate how to put the principles of tidy, modular simulation into practice. To illustrate the process of programming a simulation, we reconstruct the simulations from @brown1974SmallSampleBehavior. We will also use this case study as a recurring example in some of the following chapters.

@brown1974SmallSampleBehavior studied methods for null hypothesis testing in studies that measure a characteristic \(X\) on samples from each of several groups. They consider a population consisting of \(G\) separate groups, with population means \(\mu_1,...,\mu_G\) and population variances \(\sigma_1^2,...,\sigma_G^2\) for the characteristic \(X\). A researcher obtains samples of size \(n_1,...,n_G\) from each of the groups and takes measurements of the characteristic for each sampled unit. Let \(x_{ig}\) denote the measurement from unit \(i\) in group \(g\), for \(i = 1,...,n_g\) for each \(j = 1,..., G\). The researcher’s goal is to use the sample data to test the hypothesis that the population means are all equal \[ H_0: \mu_1 = \mu_2 = \cdots = \mu_G. \]

If the population variances were all equal (so \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_G^2\)), we could use a conventional one-way analysis of variance (ANOVA) to conduct this test. However, one-way ANOVA might not work well if the variances are not equal. The question is then: what are best practices for testing the null of equal group means, allowing for the possibility that variances could differ across groups?

To tackle this question, Brown and Forsythe evaluated two different hypothesis testing procedures, developed by @james1951ComparisonSeveralGroups and @welch1951ComparisonSeveralMean, both of which avoid the assumption that all groups have equal equality of variances. Brown and Forsythe also evaluated the conventional one-way ANOVA F-test as a benchmark, even though this procedure maintains the assumption of equal variances. They also proposed and evaluated a new procedure of their own devising.⁹ Their simulation involved comparing the performance of these different hypothesis testing procedures (the methods) under a range of conditions (different data generating processes) with different sample sizes and different degrees of heteroskedasticity. They looked at the different scenarios shown as Table 5.1, varying number of groups, group size, and amount of variation within each group. In all, there are a total of 20 scenarios, covering conditions with between 10 and 6 groups.

Table 5.1: Simulation scenarios explored by Brown and Forsythe (1974)
Scenario	Groups	Sample Sizes	Standard Deviations
A	4	4,4,4,4	1,1,1,1
B	4		1,2,2,3
C	4	4,8,10,12	1,1,1,1
D	4		1,2,2,3
E	4		3,2,2,1
F	4	11,11,11,11	1,1,1,1
G	4		1,2,2,3
H	4	11,16,16,21	1,1,1,1
I	4		3,2,2,1
J	4		1,2,2,3
K	6	4,4,4,4,4,4	1,1,1,1,1,1
L	6		1,1,2,2,3,3
M	6	4,6,6,8,10,12	1,1,1,1,1,1
N	6		1,1,2,2,3,3
O	6		3,3,2,2,1,1
P	6	6,6,6,6,6,6	1,1,2,2,3,3
Q	6	11,11,11,11,11,11	1,1,2,2,3,3
R	6	16,16,16,16,16,16	1,1,2,2,3,3
S	6	21,21,21,21,21,21	1,1,2,2,3,3
T	10	20,20,20,20,20,20,20,20,20,20	1,1,1.5,1.5,2,2,2.5,2.5,3,3

When evaluating hypothesis testing procedures, there are two main performance metrics of interest: type-I error rate and power. The type-I error rate is the rate at which a test rejects the null hypothesis when the null hypothesis is actually true. To apply a hypothesis testing procedure, one has to specify a desired, or nominal, type-I error rate, often denoted as the \(\alpha\)-level. For a specified \(\alpha\), a valid or well-calibrated test should have an actual type-I error rate less than or equal to the nominal level, and ideally should be very close to nominal. Power is how often a test correctly rejects the null when it is indeed false. It is a measure of how sensitive a method is to violations of the null.

Brown and Forsythe estimated error rates and power for nominal \(\alpha\)-levels of 1%, 5%, and 10%. Table 1 of their paper reports the simulation results for type-I error (labeled as “size”). Our Table 5.2 reports some of their results with respect to Type I error. For a well-calibrated hypothesis testing method, the reported numbers should be very close to the desired alpha levels, as listed at the top of the table. We can compare the four tests to each other across each row, where each row is a specific scenario defined by a specific data generating process. Looking at ANOVA, for example, we see some scenarios with very elevated rates. For instance, in Scenario E, the ANOVA F-test has 21.9% rejection when it should only have 10%. In contrast, the ANOVA F works fine under scenario A, which is what one would expect because all the groups have the same variance. Brown and Forsythe’s choice of scenarios here illustrates a broader design principle: to provide a full picture of the performance of a method or set of methods, it is wise to always evaluate them under conditions where we expect things to work, as well as conditions where we expect them to not work well.

Table 5.2: Portion of “Table 1” reproduced from Brown and Forsythe (1974)
	ANOVA F test			B & F’s F* test			Welch’s test			James’ test
Scenario	10%	5%	1%	10%	5%	1%	10%	5%	1%	10%	5%	1%
A	10.2	4.9	0.9	7.8	3.4	0.5	9.6	4.5	0.8	13.3	7.9	2.4
B	12.0	6.7	1.7	8.9	4.1	0.7	10.3	4.7	0.8	13.8	8.1	2.7
C	9.9	5.1	1.1	9.5	4.8	1.0	10.8	5.7	1.6	12.1	6.7	2.1
D	5.9	3.0	0.6	10.3	5.7	1.4	9.8	4.9	0.9	10.8	5.6	1.3
E	21.9	14.4	5.6	11.0	6.2	1.8	11.3	6.5	2.0	12.9	7.7	2.9
F	10.1	5.1	1.0	9.8	5.7	1.5	10.0	5.0	0.9	10.6	5.5	1.1
G	11.4	6.3	1.8	10.7	5.7	1.5	10.1	5.0	1.1	10.6	5.4	1.3
H	10.3	4.9	1.1	10.3	5.1	1.0	10.2	5.0	1.0	10.5	5.3	1.2
I	17.3	10.8	3.9	11.1	6.2	1.8	10.5	5.5	1.2	10.9	5.8	1.3
J	7.3	4.0	1.0	11.5	6.5	1.8	10.6	5.4	1.1	10.9	5.6	1.1
K	9.6	4.9	1.0	7.3	3.4	0.4	11.4	6.1	1.4	14.7	9.5	3.8

To replicate the Brown and Forsythe simulation, we will first write functions to generate data for a specified scenario and evaluate data of a given structure. We will then use these functions to evaluate the hypothesis testing procedures in a specific scenario with a particular set of parameters (e.g., sample sizes, number of groups, and so forth). We will then scale up to execute the simulations for a range of scenarios that vary the parameters of the data-generating model, just as reported in Brown and Forsythe’s paper.

5.1 The data-generating model

In the heteroskedastic one-way ANOVA simulation, there are three sets of parameter values: population means, population variances, and sample sizes. Rather than attempting to write a general data-generating function immediately, it is often easier to write code for a specific case first and then use that code as a starting point for developing a function. For example, say that we have four groups with means of 1, 2, 5, 6; variances of 3, 2, 5, 1; and sample sizes of 3, 6, 2, 4:

mu <- c(1, 2, 5, 6)
sigma_sq <- c(3, 2, 5, 1)
sample_size <- c(3, 6, 2, 4)

Following @brown1974SmallSampleBehavior, we will assume that the measurements are normally distributed within each sub-group of the population. The following code generates a vector of group id’s and a vector of simulated measurements:

N <- sum(sample_size) # total sample size
G <- length(sample_size) # number of groups

# group id factor
group <- factor(rep(1:G, times = sample_size))

# mean for each unit of the sample
mu_long <- rep(mu, times = sample_size) 

# sd for each unit of the sample
sigma_long <- rep(sqrt(sigma_sq), times = sample_size) 

# See what we have?
tibble( group = group, mu = mu_long, sigma = sigma_long )

## # A tibble: 15 × 3
##    group    mu sigma
##    <fct> <dbl> <dbl>
##  1 1         1  1.73
##  2 1         1  1.73
##  3 1         1  1.73
##  4 2         2  1.41
##  5 2         2  1.41
##  6 2         2  1.41
##  7 2         2  1.41
##  8 2         2  1.41
##  9 2         2  1.41
## 10 3         5  2.24
## 11 3         5  2.24
## 12 4         6  1   
## 13 4         6  1   
## 14 4         6  1   
## 15 4         6  1

Now we have the pieces needed to generate a small dataset consisting of group memberships and the measured characteristic:

# Now make our data
x <- rnorm(N, mean = mu_long, sd = sigma_long)
dat <- tibble(group = group, x = x)
dat

## # A tibble: 15 × 2
##    group      x
##    <fct>  <dbl>
##  1 1      1.24 
##  2 1      3.07 
##  3 1     -0.681
##  4 2      2.43 
##  5 2      2.50 
##  6 2      2.15 
##  7 2      0.612
##  8 2      0.860
##  9 2      2.09 
## 10 3      1.56 
## 11 3      5.08 
## 12 4      5.68 
## 13 4      5.66 
## 14 4      5.92 
## 15 4      4.38

We have followed the strategy of first constructing a dataset with parameters for each observation in each group, making heavy use of base R’s rep() function to repeat values in a list. We then called rnorm() to generate N observations in all. This works because rnorm() is vectorized; if you give it a vector (or vectors) of parameter values, it will generate each subsequent observation according to the next entry in the vector. As a result, the first x value is simulated from a normal distribution with mean mu_long[1] and standard deviation sd_long[1], the second x is simulated using mu_long[2] and sd_long[2], and so on.

As usual, there are many different and legitimate ways of doing this in R. For instance, instead of using rep() to do it all at once, we could generate observations for each group separately, then stack the observations into a dataset. Do not worry about trying to writing code the “best” way—especially when you are initially putting a simulation together. If you can find a way to accomplish your task at all, then that’s often enough (and you should feel good about it!).

5.1.1 Now make a function

Because we will need to generate datasets over and over, we will wrap our code in a function. The inputs to the function will be the parameters of the model that we specified at the very beginning: the set of population means mu, the population variances sigma_sq, and sample sizes sample_size. We make these quantities arguments of the data-generating function so that we can make datasets of different sizes and shapes:

generate_ANOVA_data <- function(mu, sigma_sq, sample_size) {
  
  N <- sum(sample_size)
  G <- length(sample_size)
  
  group <- factor(rep(1:G, times = sample_size))
  mu_long <- rep(mu, times = sample_size)
  sigma_long <- rep(sqrt(sigma_sq), times = sample_size)
  
  x <- rnorm(N, mean = mu_long, sd = sigma_long)
  sim_data <- tibble(group = group, x = x)
  
  return(sim_data)
}

The function is simply the code we built previously, all bundled up. We developed the function by first writing code to make the data-generating process to work once, the way we want, and only then turning the final code into a function for later reuse.

Once we have turned the code into a function, we can call it to get a new set of simulated data. For example, to generate a dataset with the same parameters as before, we can do:

sim_data <- generate_ANOVA_data(
  mu = mu, 
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

str(sim_data)

## tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
##  $ group: Factor w/ 4 levels "1","2","3","4": 1 1 1 2 2 2 2 2 2 3 ...
##  $ x    : num [1:15] 0.777 2.115 1.31 1.848 3.041 ...

To generate one with population means of zero in all the groups, but the same group variances and sample sizes as before, we can do:

sim_data_null <- generate_ANOVA_data(
  mu = c( 0, 0, 0, 0 ),
  sigma_sq = sigma_sq, 
  sample_size = sample_size
)

str(sim_data)

## tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
##  $ group: Factor w/ 4 levels "1","2","3","4": 1 1 1 2 2 2 2 2 2 3 ...
##  $ x    : num [1:15] 0.777 2.115 1.31 1.848 3.041 ...

Following the principles of tidy, modular simulation, we have written a function that returns a rectangular dataset for further analysis. Also note that the dataset returned by generate_ANOVA_data() only includes the variables group and x, but not mu_long or sd_long. This is by design. Including mu_long or sd_long would amount to making the population parameters available for use in the data analysis procedures, which is not something that happens when analyzing real data.

5.1.2 Cautious coding

In the above, we built some sample code, and then bundled it into a function by literally cutting and pasting the initial work we did into a function skeleton. In the process, we shifted from having variables in our workspace with different names to using those variable names as parameters in our function call.

Developing code in this way is not without hazards. In particular, after we have created our function, our workspace is left with a variable mu in it and our function also has a parameter named mu. Inside the function, R will use the parameter mu first, but this is potentially confusing. Another potential source of confusion are lines such as mu = mu, which means “set the function’s parameter called mu to the variable called mu.” These are different things (with the same name).

Once you have built a function, one way to check that it is working properly is to comment out the initial code (or delete it), clear out the workspace (or restart R), and then re-run the code that uses the function. If things still work, then you can be somewhat confident that you have successfully bundled your code into the function. Once you bundle your code, you can also do a search and replace to change the variable names inside your function to something more generic, to better clarify the distinction betwen object names and argument names.

5.2 The hypothesis testing procedures

Brown and Forsythe considered four different hypothesis testing procedures for heteroskedastic ANOVA, but we will focus on just two of the tests for now. We start with the conventional one-way ANOVA that mistakenly assumes homoskedasticity. R’s oneway.test function will calculate this test automatically:

sim_data <- generate_ANOVA_data(
  mu = mu, 
  sigma_sq = sigma_sq,
  sample_size = sample_size
)

anova_F <- oneway.test(x ~ group, data = sim_data, var.equal = TRUE)
anova_F

## 
##  One-way analysis of means
## 
## data:  x and group
## F = 8.9503, num df = 3, denom df = 11,
## p-value = 0.002738

We can use the same function to calculate Welch’s test by setting var.equal = FALSE:

Welch_F <- oneway.test(x ~ group, data = sim_data, var.equal = FALSE)
Welch_F

## 
##  One-way analysis of means (not assuming
##  equal variances)
## 
## data:  x and group
## F = 22.321, num df = 3.0000, denom df =
## 3.0622, p-value = 0.01399

The main results we need here are the \(p\)-values of the tests, which will let us assess Type-I error and power for a given nominal \(\alpha\)-level. The following function takes simulated data as input and returns as output the \(p\)-values from the one-way ANOVA test and Welch test:

ANOVA_Welch_F <- function(data) {
  anova_F <- oneway.test(x ~ group, data = data, var.equal = TRUE)
  Welch_F <- oneway.test(x ~ group, data = data, var.equal = FALSE)
  
  result <- tibble(
    ANOVA = anova_F$p.value,
    Welch = Welch_F$p.value
  )
  
  return(result)
}

ANOVA_Welch_F(sim_data)

## # A tibble: 1 × 2
##     ANOVA  Welch
##     <dbl>  <dbl>
## 1 0.00274 0.0140

Following our tidy, modular simulation principles, this function returns a small dataset with the p-values from both tests. Eventually, we might want to use this function on some real data. Our estimation function does not care if the data are simulated or not; we call the input data rather than sim_data to reflect this.

As an alternative to this function, we could instead write code to implement the ANOVA and Welch tests ourselves. This has some potential advantages, such as avoiding any extraneous calculations that oneway.test does, which take time and slow down our simulation. However, there are also drawbacks to doing so, including that writing our own code takes our time and opens up the possibility of errors in our code. For further discussion of the trade-offs, see Chapter A.4, where we do implement these tests by hand and see what kind of speed-ups we can obtain.

5.3 Running the simulation

We now have functions that implement steps 2 and 3 of the simulation. Given some parameters, generate_ANOVA_data produces a simulated dataset and, given some data, ANOVA_Welch_F calculates \(p\)-values two different ways. We now want to know which way is better, and by how much. To answer this question, we will need to repeat the chain of generate-and-analyze calculations a bunch of times. To facilitate repetition, we first put the components together into a single function:

one_run = function( mu, sigma_sq, sample_size ) {
  sim_data <- generate_ANOVA_data( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size )
  ANOVA_Welch_F(sim_data)
}

one_run( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size )

## # A tibble: 1 × 2
##    ANOVA Welch
##    <dbl> <dbl>
## 1 0.0167 0.107

This function implements a single simulation trial by generating artificial data and then analyzing the data, ending with a tidy dataset that has results for the single run.

We next call one_run() over and over; see Appendix 8.1 for some discussion of options. The following uses repeat_and_stack() from simhelpers to evaluate one_run() 4 times and then stack the results into a single dataset:

library(simhelpers)

sim_data <- repeat_and_stack(4, 
  one_run( mu = mu, sigma_sq = sigma_sq, sample_size = sample_size)
)
sim_data

## # A tibble: 4 × 2
##     ANOVA  Welch
##     <dbl>  <dbl>
## 1 0.0262  0.0125
## 2 0.00451 0.0698
## 3 0.00229 0.0380
## 4 0.0108  0.0423

Voila! We have simulated \(p\)-values!

5.4 Summarizing test performance

We now have all the pieces in place to reproduce the results from Brown and Forsythe (1974). We first focus on calculating the actual type-I error rate of these tests—that is, the proportion of the time that they reject the null hypothesis of equal means when that null is actually true—for an \(\alpha\)-level of .05. To evaluate the type-I error rate, we need to simulate data from a process where the population means are indeed all equal. Arbitrarily, let’s start with \(G = 4\) groups and set all of the means equal to zero:

mu <- rep(0, 4)

In the fifth row of Table 1 (Scenario E in our Table 5.1), Brown and Forsythe examine performance for the following parameter values for sample size and population variance:

sample_size <- c(4, 8, 10, 12)
sigma_sq <- c(3, 2, 2, 1)^2

With these parameter values, we can use map_dfr to simulate 10,000 \(p\)-values:

p_vals <- repeat_and_stack(10000, 
  one_run(
    mu = mu,
    sigma_sq = sigma_sq,
    sample_size = sample_size
  ) 
)

We can estimate the rejection rates by summarizing across these replicated p-values. The rule is that the null is rejected if the \(p\)-value is less than \(\alpha\). To get the rejection rate, we calculate the proportion of replications where the null is rejected:

sum(p_vals$ANOVA < 0.05) / 10000

## [1] 0.1391

This is equivalent to taking the mean of the logical conditions:

mean(p_vals$ANOVA < 0.05)

## [1] 0.1391

We get a rejection rate that is much larger than \(\alpha = .05\). We have learned that the ANOVA F-test does not adequately control Type-I error under this set of conditions.

mean(p_vals$Welch < 0.05)

## [1] 0.0697

The Welch test does much better, although it appears to be a little bit in excess of 0.05.

Note that these two numbers are quite close (though not quite identical) to the corresponding entries in Table 1 of Brown and Forsythe (1974). The difference is due to the fact that both Table 1 and are results are actually estimated rejection rates, because we have not actually simulated an infinite number of replications. The estimation error arising from using a finite number of replications is called simulation error (or Monte Carlo error). In Chapter 9, we will look more at how to estimate and control the Monte Carlo simulation error in performance measures.

So there you have it! Each part of the simulation is a distinct block of code, and together we have a modular simulation that can be easily extended to other scenarios or other tests. The exercises at the end of this chapter ask you to extend the framework further. In working through them, you will get to experience first-hand how the modular code that we have started to develop is easier to work with than a single, monolithic block of code.

5.5 Exercises

The following exercises involve exploring and tweaking the above simulation code we have developed to replicate the results of Brown and Forsythe (1974).

5.5.1 Other \(\alpha\)’s

Table 1 from Brown and Forsythe reported rejection rates for \(\alpha = .01\) and \(\alpha = .10\) in addition to \(\alpha = .05\). Calculate the rejection rates of the ANOVA F and Welch tests for all three \(\alpha\)-levels and compare to the table.

5.5.2 Compare results

Try simulating the Type-I error rates for the parameter values in the first two rows of Table 1 of the original paper. Use 10,000 replications. How do your results compare to the report results?

5.5.3 Power

In the primary paper, Table 1 is about Type I error and Table 2 is about power. A portion of Table 2 follows:

Table 5.3: Portion of “Table 2” reproduced from Brown and Forsythe (1974)
Variances	Means	Brown’s F	B & F’s F*	Welch’s W
1,1,1,1	0,0,0,0	4.9	5.1	5.0
	1,0,0,0	68.6	67.6	65.0
3,2,2,1	0,0,0,0	NA	6.2	5.5
	1.3,0,0,1.3	NA	42.4	68.2

In the table, the sizes of the four groups are 11, 16, 16, and 21, for all the scenarios. Try simulating the power levels for a couple of sets of parameter values from Table 5.3. Use 10,000 replications. How do your results compare to the results reported in the Table?

5.5.4 Wide or long?

Instead of making ANOVA_Welch_F return a single row with the columns for the \(p\)-values, one could instead return a dataset with one row for each test. The “long” approach is often nicer when evaluating more than two methods, or when each method returns not just a \(p\)-value but other quantities of interest. For our current simulation, we might also want to store the \(F\) statistic, for example. The resulting dataset would then look like the following:

ANOVA_Welch_F_long(sim_data)

## # A tibble: 2 × 3
##   method Fstat  pvalue
##   <chr>  <dbl>   <dbl>
## 1 ANOVA   8.46 0.00338
## 2 Welch  14.3  0.0241

Modify ANOVA_Welch_F() to return output in this format, update your simulation code, and then use group_by() plus summarise() to calculate rejection rates of both tests. group_by() is a method for dividing your data into distinct groups and conducting an operation on each. The classic form of this would be something like the following:

sres <- 
  res %>% 
  group_by( method ) %>%
  summarise( rejection_rate = mean( pvalue < 0.05 ) )

5.5.5 Other tests

The onewaytests package in R includes functions for calculating Brown and Forsythe’s \(F^*\) test and James’ test for differences in population means. Modify the data analysis function ANOVA_Welch_F (or, better yet, ANOVA_Welch_F_long from Exercise 5.5.4) to also include results from these hypothesis tests. Re-run the simulation to estimate the type-I error rate of all four tests under Scenarios A and B of Table 5.1.

5.5.6 Methodological extensions

What other methodological questions might you want to investigate about the one-way ANOVA problem? List two or three questions that could be investigated with further simulations.

5.5.7 Power analysis

Suppose you were conducting an experimental study involving several groups (perhaps \(G = 3\) or \(4\) or \(5\)). Because of logistical constraints, one of the groups will need to include at least 40% of the full sample, and you anticipate that this group might have more variable outcomes than the other groups. Your collaborators want to know how large a total sample they will need to recruit in order to have a high probability of detecting significant differences between groups. How could you use the functions you’ve developed to answer this question? List out the steps in your approach, and make note of any further information or assumptions you will need in order to answer the question.

This latter piece makes Brown and Forsythe’s study a prototypical example of a statistical methodology paper: find some problem that current procedures do not perfectly solve, invent something to do a better job, and then do simulations and/or math to build a case that the new procedure is better.↩︎