# Chapter 10 Project: Cronbach Alpha

In this section we walk through a case study of Cronbach Alpha to give an extended “project,” or series of exercises, that walk you through writing a complete simulation starting with the filling out of the code skeleton we get from simhelpers’s create_skeleton() package.

## 10.1 Background

Cronbach’s $$\alpha$$ coefficient is commonly reported as a measure of the internal consistency among a set of test items. Consider a set of $$p$$ test items with population variance-covariance matrix $$\boldsymbol\Phi = \left[\phi_{ij}\right]_{i,j=1}^p$$, where $$\phi_{ij}$$ is the covariance of item $$i$$ and item $$j$$ on the test, across all students taking the test. This population variance-covariance matrix describes how our $$p$$ test items co-vary.

Cronback’s $$\alpha$$ is, under this model, defined as $\alpha = \frac{p}{p - 1}\left(1 - \frac{\sum_{i=1}^p \phi_{ii}}{\sum_{i=1}^p \sum_{j=1}^p \phi_{ij}}\right).$

Given a sample of size $$n$$, the usual estimate of $$\alpha$$ is obtained by replacing the population variances and covariances with corresponding sample estimates. Letting $$s_{ij}$$ denote the sample covariance of items $$i$$ and $$j$$

$A = \frac{p}{p - 1}\left(1 - \frac{\sum_{i=1}^p s_{ii}}{\sum_{i=1}^p \sum_{j=1}^p s_{ij}}\right).$

If we assume that the items follow a multivariate normal distribution, then $$A$$ corresponds to the maximum likelihood estimator of $$\alpha$$.

Our goal is to examine the properties of $$A$$ when the set of $$P$$ items is not multi-variate normal, but rather follows a multivariate $$t$$ distribution with $$v$$ degrees of freedom. For simplicity, we shall assume that the items have common variance and have a compound symmetric covariance matrix, such that $$\phi_{11} = \phi_{22} = \cdots = \phi_{pp} = \phi$$ and $$\phi_{ij} = \rho \phi$$. In this case we can simplify our expression for $$\alpha$$ to

$\alpha = \frac{p \rho}{1 + \rho (p - 1)}.$

## 10.2 Getting started

First create the skeleton of our simulation. We will then walk through filling in all the pieces.

library( simhelpers )
create_skeleton()

## 10.3 The data-generating function

The first two sections in the skeleton are about the data-generating model:

rm(list = ls())

#------------------------------------------------------
# Set development values for simulation parameters
#------------------------------------------------------

# What are your model parameters?
# What are your design parameters?

#------------------------------------------------------
# Data Generating Model
#------------------------------------------------------

dgm <- function(model_params) {

return(dat)
}

# Test the data-generating model - How can you verify that it is correct?

We need to create and test a function that takes model parameters (and sample sizes and such) as inputs, and produces a simulated dataset. The following function generates a sample of $$n$$ observations of $$p$$ items from a multivariate $$t$$-distribution with a compound symmetric covariance matrix, intra-class correlation $$\rho$$, and $$v$$ degrees of freedom:

# model parameters
alpha <- 0.73 # true alpha
df <- 12 # degrees of freedom

# design parameters
n <- 50 # sample size
p <- 6 # number of items

library(mvtnorm)

r_mvt_items <- function(n, p, alpha, df) {
icc <- alpha / (p - alpha * (p - 1))
V_mat <- icc + diag(1 - icc, nrow = p)
X <- rmvt(n = n, sigma = V_mat, df = df)
colnames(X) <- LETTERS[1:p]
X
}

Note how we translate the target $$\alpha$$ to $$ICC$$ for our DGP; we will see this type of translation more later on.

We check our method first to see if we get the right kind of data:

small_sample <- r_mvt_items(n = 8, p = 3, alpha = 0.73, df = 5)
small_sample
##               A          B          C
## [1,]  0.3341106  0.5437804 -0.3933968
## [2,]  0.4003379  0.5999212  1.2887095
## [3,]  1.6445830  0.1904682  2.0970807
## [4,]  0.4581853  0.2394757  0.4814409
## [5,]  0.6105510  1.0636554 -0.1447137
## [6,]  0.1949575  0.3631333  1.9007522
## [7,] -1.0117863 -2.2646167 -0.5493305
## [8,] -2.6297928 -0.1129586  0.2033775

It looks like we have 8 observations of 3 items, as desired.

To check that the function is indeed simulating data following the intended distribution, let’s next generate a very large sample of items. We can then verify that the correlation matrix of the items is compound-symmetric and that the marginal distributions of the items follow $$t$$-distributions with specified degrees of freedom.

big_sample <- r_mvt_items(n = 100000, p = 4, alpha = 0.73, df = 5)
round(cor(big_sample), 3)
##       A     B     C     D
## A 1.000 0.395 0.398 0.395
## B 0.395 1.000 0.407 0.397
## C 0.398 0.407 1.000 0.405
## D 0.395 0.397 0.405 1.000

Is this what it should look like?

We can also check normality:

qqplot(qt(ppoints(200), df = 5), big_sample[,2], ylim = c(-4,4))

Looks good! A nice straight line.

## 10.4 The estimation function

The next section of the template looks like this:

#------------------------------------------------------
# Model-fitting/estimation/testing functions
#------------------------------------------------------

estimate <- function(dat, design_params) {

return(result)
}

# Test the estimation function

van Zyl, Neudecker, and Nel (2000) demonstrate that, if the items have a compound-symmetric covariance matrix, then the asymptotic variance of $$A$$ is $\text{Var}(A) \approx \frac{2p(1 - \alpha)^2}{(p - 1) n}.$ Substituting $$A$$ in place of $$\alpha$$ on the right hand side gives an estimate of the variance of $$A$$. The following function calculates $$A$$ and its variance estimator from a sample of data:

estimate_alpha <- function(dat) {
V <- cov(dat)
p <- ncol(dat)
n <- nrow(dat)

# Calculate A with our formula
A <- p / (p - 1) * (1 - sum(diag(V)) / sum(V))

# Calculate our estimate of the variance (SE^2) of A
Var_A <- 2 * p * (1 - A)^2 / ((p - 1) * n)

# Pack up our results
data.frame(A = A, Var = Var_A)
}

estimate_alpha(small_sample)
##           A        Var
## 1 0.6849894 0.03721188

The psych package provides a function for calculating $$\alpha$$, which can be used to verify that the calculation of $$A$$ in estimate_alpha is correct:

library(psych)

## 10.6 Replication (and the simulation)

We now have all the components we need to get simulation results, given a set of parameter values. In the next section of the template, we put all these pieces together in a function—which we might call the simulation driver—that takes as input 1) parameter values, 2) the desired number of replications, and 3) optionally, a seed value (this allows for reproducability, see Chapter 16. The function produces as output a single set of performance estimates. Generically, the function looks like this:

#-----------------------------------------------------------
# Simulation Driver - should return a data.frame or tibble
#-----------------------------------------------------------

runSim <- function(iterations, model_params, design_params, seed = NULL) {
if (!is.null(seed)) set.seed(seed)

results <- rerun(iterations, {
dat <- dgm(model_params)
estimate(dat, design_params)
}) %>%
bind_rows()

performance(results, model_params)
}

# demonstrate the simulation driver

The runSim function should require very little modification for a new simulation. Essentially, all we need to change is the names of the functions that are called, so that they line up with the functions we have designed for our simulation. Here’s what this looks like for the Cronbach alpha simulation (we pull out the code to replicate into its own method, one_run(), which helps with debugging):

#-----------------------------------------------------------
# Simulation Driver - should return a data.frame or tibble
#-----------------------------------------------------------

one_run <- function( n, p, alpha, df ) {
dat <- r_mvt_items(n = n, p = p, alpha = alpha, df = df)
estimate_alpha(dat)
}

run_alpha_sim <- function(iterations, n, p, alpha, df, seed = NULL) {

if (!is.null(seed)) set.seed(seed)

results <-
rerun(iterations, one_run(n, p, alpha, df) ) %>%
bind_rows()

alpha_performance(results, alpha = alpha)
}

## 10.7 Extension: Confidence interval coverage

However, van Zyl, Neudecker, and Nel (2000) suggest that a better approximation involves first applying a transformation to $$A$$ (to make it more normal in shape), then calculating a confidence interval, then back-transforming to the original scale (this is very similar to the procedure for calculating confidence intervals for correlation coefficients, using Fisher’s $$z$$ transformation). Let our transformed parameter and estimator be

\begin{aligned} \beta &= \frac{1}{2} \ln\left(1 - \alpha\right) \\ B &= \frac{1}{2} \ln\left(1 - A\right) \end{aligned}

and our transformed variance estimator be

$V^B = \frac{p}{2 n (p - 1)}.$ (This expression comes from a Delta method expansion on $$A$$.)

An approximate confidence interval for $$\beta$$ is given by $$[B_L, B_U]$$, where

$B_L = B - z \sqrt{V^B}, \qquad B_U = B + z \sqrt{V^B}.$

Applying the inverse of the transformation gives a confidence interval for $$\alpha$$:

$\left[1 - \exp(2B_U), \ 1 - \exp(2 B_L)\right].$

## 10.8 A taste of multiple scenarios

In the previous sections, we’ve created code that will generate a set of performance estimates, given a set of parameter values. We can created a dataset that represents every combination of parameter values that we want to examine. How do we put the pieces together?

If we only had a couple of parameter combinations, it would be easy enough to just call our run_alpha_sim function a couple of times:

run_alpha_sim(iterations = 100, n = 50, p = 4, alpha = 0.7, df = 5)
## Warning: rerun() was deprecated in purrr 1.0.0.
## ℹ Please use map() instead.
##   # Previously
##   rerun(100, one_run(n, p, alpha, df))
##
##   # Now
##   map(1:100, ~ one_run(n, p, alpha, df))
## This warning is displayed once every 8 hours.
## Call lifecycle::last_lifecycle_warnings() to
## see where this warning was generated.
##          bias    bias_SE        SE       MSE
## 1 -0.01211522 0.01088426 0.1088426 0.1089725
##    bias_Var
## 1 0.4913622
run_alpha_sim(iterations = 100, n = 100, p = 4, alpha = 0.7, df = 5)
##         bias    bias_SE        SE        MSE
## 1 -0.0117337 0.00761985 0.0761985 0.07671916
##    bias_Var
## 1 0.4727169
run_alpha_sim(iterations = 100, n = 50, p = 8, alpha = 0.7, df = 5)
##          bias    bias_SE        SE       MSE
## 1 -0.02601087 0.01120983 0.1120983 0.1145292
##    bias_Var
## 1 0.4319068
run_alpha_sim(iterations = 100, n = 100, p = 8, alpha = 0.7, df = 5)
##          bias     bias_SE         SE        MSE
## 1 0.007758062 0.006203414 0.06203414 0.06220884
##    bias_Var
## 1 0.5299059

But in an actual simulation we will probably have too many different combinations to do this “by hand.” The final sections of the simulation template demonstrate two different approaches to doing the calculations for every combination of parameter values, given a set of parameter values one wants to explore.

This is discussed further in Chapter @ref(exp_design), but let’s get a small taste of doing this now. In particular, the following code will evaluate the performance of $$A$$ for true values of $$\alpha$$ ranging from 0.5 to 0.9 (i.e., alpha_true_seq <- seq(0.5, 0.9, 0.1)) via map_df():

alpha_true_seq <- seq(0.5, 0.9, 0.1)
results <- map_df( alpha_true_seq,
run_simulation,
R = 100,
n = 50, p = 5, df = 5 )

How does coverage change for different values of $$A$$?

### 10.8.1 Exercises

1. Show the inverse transform of $$B = g(A)$$ gives the above expression.

2. Make a new function, estimate_alpha_xform() that, given a dataset, calculates a confidence interval for $$\alpha$$ following the method described above.

3. Using the modified estimate_alpha_xform(), generate 1000 replicated confidence intervals for $$n = 40$$, $$p = 6$$ items, a true $$\alpha = 0.8$$, and $$v = 5$$ degrees of freedom. Using these replicates, calculate the true coverage rate of the confidence interval. Also calculate the Monte Carlo standard error (MCSE) of this coverage rate.

4. Calculate the average length of the confidence interval for $$\alpha$$, along with its MCSE.

5. Compare the results of this approach to the more naive approach. Are there gains in performance?

6. Challenge Derive the variance expression for the transformed estimator using the Delta method on the variance expression for $$A$$ coupled with the transform. The Delta method says that:

$var( f(A) ) \approx \frac{1}{f'(\alpha)} (A - \alpha)^2 .$