Chapter 2 Programming Preliminaries

In this chapter, we introduce some essential programming concepts that may be less familiar to readers, but which are central to how we approach writing code for simulation studies. We also explain some of the rationale and reasoning behind how we present example code throughout the book.

2.1 Welcome to the tidyverse

Layered on top of R are a collection of contributed packages that make data wrangling and management much, much easier. This collection is called tidyverse and it includes popular packages such as dplyr, tidyr, and ggplot2. We use methods from the “tidyverse” throughout the book because it facilitates writing clean, concise code. In particular, we make heavy use of the dplyr package for group-wise data manipulation, the purrr package for functional programming, and the ggplot2 package for statistical graphics. The 1st edition or 2nd edition of the free online textbook R for Data Science provide an excellent, thorough introduction to these packages, along with much more background on the tidyverse. We will cite portions of this text throughout the book.

Loading the tidyverse packages is straightforward:

library( tidyverse )
options(list(dplyr.summarise.inform = FALSE))

(The second line is to turn off some of the persistent warnings generated by the dplyr function summarize().) These lines of code appear in the header of nearly every script we use.

2.2 Functions

If you are comfortable using R for data analysis tasks, you will be familiar with many of R’s functions. R has function to do things like calculate a summary statistic from a list of numbers (e.g., mean(), median(), or sd()), calculate linear regression coefficient estimates from a dataset (lm()), or count the number of rows in a dataset (nrow()). In the abstract, a function is a little machine for transforming ingredients into outputs, like a microwave (put a bag of kernels in and it will return hot, crunchy popcorn), a cheese shredder (put a block of mozzarella in and it transforms it into topping for your pizza), or a washing machine (put in dirty clothes and detergent and it will return clean but damp clothes). A function takes in pieces of information specified by the user (the inputs), follows a set of instructions for transforming or summarizing those inputs, and then returns the result of the calculations (the outputs).

A function can do nearly anything as long as the calculation can be expressed in code—it can even produce output that is random. For example, the rnorm() function takes as input a number n and returns that many random numbers, drawn from a standard normal distribution:

rnorm(3)
## [1]  0.01311744  1.05431516 -0.20010145

Each time the function is called, it returns a different set of numbers:

rnorm(3)
## [1] 0.7915223 0.2086413 1.0808711

The rnorm() function also has further input arguments that let the user specify the mean and standard deviation of the distribution from which numbers are drawn:

rnorm(3, mean = 10, sd = 0.5)
## [1] 10.888172 10.121414  9.980158

In writing code for simulations, we will make extensive use of this particular function and other functions that produce sequences of random numbers. We will have more to say about random number generation later.

2.2.1 Rolling your own

In R, you can create your own function by specifying the pieces of input information, the steps to follow in transforming the inputs, and the result to return as output. Learning to write your own functions to carry out calculations is an immensely useful skill that will greatly enhance your ability to accomplish a range of tasks. Writing custom functions is also central to our approach to coding Monte Carlo simulations, and so we highlight some of the key considerations here. Chapter 19 of R for Data Science (1st edition) provides an in-depth discussion of how to write your own functions.

Here is an example of a custom function called one_run():

one_run <- function( N, mn, sd ) {
  vals <- rnorm( N, mean = mn, sd = sd )
  tt <- t.test( vals )
  pvalue <- tt$p.value
  return(pvalue)
}

The first line specifies that we are creating a function that takes inputs N, mn, and sd. These are called the parameters, inputs, or arguments of the function. The remaining lines inside the curly brackets are called the body of the function. These lines specify the instructions to follow in transforming the inputs into an output:

  1. Generate a random sample of N observations from a normal distribution with mean mn and store the result in vals.
  2. Use the built-in function t.test() to compute a one-sample t-test for the null hypothesis that the population mean is zero, then store the result in tt.
  3. Extract the p-value from the t-test store the result in pvalue.
  4. Return pvalue as output.

Having created the function, we can then use it with any inputs that we like:

one_run( 100, 5, 1 )
## [1] 5.797385e-70
one_run( 10, 0.3, 1 )
## [1] 0.9370292
one_run( 10, 0.3, 0.2 )
## [1] 0.0001216494

In each case, the output of the function is a p-value from a simulated sample of data. The function produces a different answer each time because its instructions involve generating random numbers each time it is called. In essence, our custom function is just a short-cut for carrying out its instructions. Writing it saves us from having to repeatedly write or copy-paste the lines of code inside its body.

2.2.2 A dangerous function

Writing custom functions will prove to be crucial for effectively implementing Monte Carlo simulations. However, designing custom functions does take practice to master. It also requires a degree of care above and beyond what is needed just to use R’s built-in functions.

One of the common mistakes encountered in writing custom functions is to let the function depend on information that is not part of the input arguments. For example, consider the following script, which includes a nonsensical custom function called funky():

secret_ingredient <- 3

funky <- function(input1, input2, input3) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

funky(3, 2, 5)
## [1] 5.5

funky takes inputs input1, input2, and input3, but its instructions also depend on the quantity secret_ingredient. What happens if we change the value of secret_ingredient?

secret_ingredient <- 100
funky(3, 2, 5)
## [1] 102.5

Even though we give it the same arguments as previously, the output of the function is different. This sort of behavior is confusing. Unless the function involves generating random numbers, we would generally expect it to return the exact same output if we give it the same inputs. Even worse, we get a rather cryptic error if the value of secret_ingredient is not compatible with what the function expects:

secret_ingredient <- "A"
funky(3, 2, 5)
## Error in input3 * ratio + secret_ingredient: non-numeric argument to binary operator

If we are not careful, we will end up with very confusing code that can very easily lead to unintended results and errors.

To avoid this issue, it is important for functions to only use information that is explicitly provided to it through its arguments. This is the principle of isolating the inputs. If the result of a function is supposed to depend on a quantity, then we should include that quantity among the input arguments. We can fix our example function by including secret_ingredient as an argument:

secret_ingredient <- 3

funkier <- function(input1, input2, input3, secret_ingredient) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

funkier(3, 2, 5, 3)
## [1] 5.5

Now the output of the function is always the same, regardless of the value of other objects in R:

secret_ingredient <- 100
funkier(3, 2, 5, 3)
## [1] 5.5

The input parameter secret_ingredient holds sway here, even though there is also an object with the same name.2 If we want to try 100, we have to do so explicitly:

funkier(3, 2, 5, 100)
## [1] 102.5

When writing your own functions, it may not be obvious that your function depends on external quantities and does not isolate the inputs. In our experience, one of the best ways to detect this issue is to clear the R environment and start from a fresh palette, run the code to create the function, and call the function (perhaps more than once) to ensure that it works as expected. Here is an illustration of what happens when we follow this process with our problematic custom function:

# clear environment
rm(list=ls()) 

# create function
funky <- function(input1, input2, input3) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

# test the function
funky(3, 2, 5)
## Error in funky(3, 2, 5): object 'secret_ingredient' not found

We get an error because the external quantity is not available. Here is the same process using our corrected function:

# clear environment
rm(list=ls()) 

# create function
funkier <- function(input1, input2, input3, secret_ingredient) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}


# test the function
funkier(3, 2, 5, secret_ingredient = 3)
## [1] 5.5

2.2.3 Using Named Arguments

When calling a function (whether it is a built-in function or a custom function that you developed), you can specify which values correspond to which arguments using argument names. Using argument names greatly enhances the readability and flexibility of function calls. When you specify inputs by name, R matches the values to the arguments based on the names rather than the order in which they appear. This feature is particularly useful in complex functions with many optional arguments.

For example, consider the function one_run() from 2.2.1:

one_run <- function( N, mn, sd ) {
  vals <- rnorm( N, mean = mn, sd = sd )
  tt <- t.test( vals )
  pvalue <- tt$p.value
  return(pvalue)
}

You could call one_run() with named arguments in any order:

result <- one_run(sd = 0.5, mn = 2, N = 500)

In this call, R knows which value to assign to each argument, so we could list the arguments however we like.

Without naming, we would have to specify the arguments in the exact order that they appear in the function definition:

result <- one_run( 500, 2, 0.5 )

Without the argument names, this line of code is harder to follow—you have to know more about the design of the function to understand how it is being used in this instance.

Getting in the habit of using named arguments will help you avoid errors. If you pass arguments without naming, and in the wrong order, you can end up with very strange results that are hard to diagnose. Even if you get it right, if someone later changes the function (say by adding a new argument in the middle of the list), your code will suddenly break with no explanation.

2.2.4 Argument Defaults

Default arguments allow you to specify typical values for parameters that a user might not need to change every time they use the function. This can make the function easier to use and less error-prone because the defaults ensure that the function behaves sensibly even when some arguments are not explicitly provided. For example, let us revise the one_run() function from above to use default arguments:

one_run <- function( N = 10, mn = 0, sd = 1 ) {
  vals <- rnorm( N, mean = mn, sd = sd )
  tt <- t.test( vals )
  pvalue <- tt$p.value
  return(pvalue)
}

Now our function has a default for N of 10, for mn of 0, and for s of 1. This means a user can run the function simply by calling one_run() without any inputs. Doing so will generate a p-value from a sample of 10 observations with a mean of zero and a standard deviation of 1.

Once we have our function with defaults, we can call the function while specifying only the inputs that differ from the default values:

bigger_sample <- one_run( N = 50 )

The function will use its default values for mn and s. Using defaults lets the user call the function more succinctly.

Later chapters will have much more to say about the process of writing custom functions, as well as many further illustrations and examples.

2.2.5 Function skeletons

In discussing how to write functions for simulations, we will often present function skeletons. By a skeleton, we mean code that creates a function with a specific set of input arguments, but where the body is left partially or fully unspecified. Here is a cursory example of a function skeleton:

run_simulation <- function( N, J, mu, sigma, tau ) {
  # simulate data
  # apply estimation procedure
  # repeat
  # summarize results
  return(results)
}

In subsequent chapters, we will use function skeletons to outline the organization of code for simulation studies. The skeleton headers make clear what the inputs to the function need to be. Sometimes, we will leave comments in the body of the skeleton to sketch out the general flow of calculations that need to happen. Depending on the details of the simulation, the specifics of these steps might be quite different, but the general structure will often be quite consistent. Finally, the last line of the skeleton indicates the value that should be returned as output of the function. Thus, skeletons are kind of like Mad Libs, but with R code instead of parts of speech.

2.3 \> (Pipe) dreams

Many of the functions from tidyverse packages are designed to make it easy to use them in sequence via the |> symbol, or pipe.3 The pipe allows us to compose several functions, meaning to write a chain of several functions as a sequence, where the result of each function becomes the first input to the next function. In code written with the pipe, the order of function calls follows like a story book or cake recipe, making it easier to see what is happening at each step in the sequence.

Consider the hypothetical functions f(), g(), and h(), and suppose we want to do a calculation that involves composing all three functions. One way to write this calculation is

res1 <- f(my_data, a = 4)
res2 <- g(res1, b = FALSE)
result <- h(res2, c = "hot sauce")

We have to store the result of each intermediate step in an object, and it takes a careful read of the code to see that we are using res1 as input to g() and res2 as input to h().

Alternately, we could try to write all the calculations as one line:

result <- h( g( f( my_data, a = 4 ), b = FALSE ), c = "hot sauce" )

This is a mess. It takes very careful parsing to see that the b argument is called as part of g() and the c argument is part of h()`, and the order in which the functions appear is not the same as the order in which they are calculated.

With the pipe we can write the same calculation as

result <- 
  my_data |>         # initial dataset
  f(a = 4) |>        # do f() to it
  g(b = FALSE) |>    # then do g()
  h(c = "hot sauce") # then do h()

This addresses the all the issues with our previous attempts: the order in which the functions appear is the same as the order in which they are executed; the additional arguments are clearly associated with the relevant functions; and there is only a single object holding the results of the calculations. Pipes are a very nice technique for writing clear code that is easy for others to follow.4

2.4 Recipes versus Patterns

As we will elaborate in subsequent chapters, we follow a modular approach to writing simulations, in which each component of the simulation is represented by its own custom function or its own object in R. This modular approach leads to code that always has the same broad structure and where the process of implementing the simulation follows a set sequence of steps. We start by coding a data-generating process, then write one or more data-analysis methods, then determine how to evaluate the performance of the methods, and finally implement an experimental design to examine the performance of the methods across multiple scenarios. Over the next several chapters, we will walk through this process several times.

Although we always follow the same broad process, the case studies that we will present are not intended as a cookbook that must be rigidly followed. In our experience, the specific features of a data-generating model, estimator, or research question sometimes require tweaking the template or switching up how we implement some aspect of the simulation. And sometimes, it might just be a question of style or preference. Because of this, we have purposely constructed the examples presented throughout the book to use different variations of our central theme rather than always following the exact same style and structure. We hope that presenting these variants and adaptations will both expand your sense of what is possible and also help you to recognize the core design principles—in other words, to distinguish the forest from the trees. Of course, we would welcome and encourage you to take any of the code verbatim, tweak and adapt it for your own purposes, and use it however you see fit. Adapting a good example is usually much easier than starting from a blank screen.

2.5 Exercises

  1. Revise the one_run() function to use |> instead of storing the simulated sample in vals.

  2. Modify the one_run() function to return a tibble() that includes separate columns for the test statistic (called tt$statistic), the p-value, and the lower and upper end-points of the confidence interval (called tt$conf.int).

  3. Modify the one_run() function so that the sample of data is generated from a non-central t distribution by substituting R’s rt() function in place of rnorm(). Make sure to modify the arguments (and argument names) of one_run() to allow the user to specify the non-centrality and degrees of freedom parameters of the non-central t distribution.

  4. The non-central t distribution is usually parameterized in terms of non-centrality parameter \(\delta\) and degrees of freedom \(\nu\), and these parameters determine the mean and spread of the distribution. Specifically, the mean of the non-central t distribution is \[\text{E}(T) = \delta \times \sqrt{\frac{\nu}{2}} \times \frac{\Gamma((\nu - 1) / 2)}{\Gamma(\nu / 2)},\] where \(\Gamma()\) is the gamma function (called gamma() in R). Create a version of the one_run() function that generates data based on a non-central t distribution, but where the input arguments are mn for the mean and df for the degrees of freedom. Here is a function skeleton to get started:

    one_run <- function( N = 10, mn = 5, df = 4) {
    
      # generate data from non-central t distribution
      # vals <- 
    
      # calculate one-sample t-test
      tt <- t.test( vals )
    
      # compile results into a tibble and return
      # res <- 
    
      return(res)
    }
  5. Modify one_run() to allow the user to specify a hypothesized value for the population mean, to use when calculating the one-sample t-test results.


  1. To learn more about how R determines which values to use when executing a function, see Section 6.4 of Advance R.↩︎

  2. The pipe is a relatively recent addition to R’s basic syntax. Prior to its inclusion in base R, the magrittr package provided—and still provides—a pipe symbol %>% that works similarly but has some additional syntactic nuances. We use base R’s |> because it is always available, even without loading any additional packages. To learn more about the nuanced %>% pipe and similar operators, see the magrittr package.↩︎

  3. Chapter 3.4 of R for Data Science (2nd edition) provides more discussion and examples of how to use |>. Chapter 18 of R for Data Science (1st edition) provides more discussion and examples of how to use magrittr’s %>%.↩︎