Chapter 2 Programming Preliminaries

Our goal in this book is not only to introduce the conceptual principles of Monte Carlo simulation, but also to provide a practical guide to actually conducting simulation studies (whether for personal learning purposes or for formal methodological research). And conducting simulations requires writing computer code—lots of code! The computational principles and practices that we will describe are very general, not specific to any particular programming language, but for purposes of demonstrating, presenting examples, and practicing the process of developing a simulation, it helps to be specific. To that end, we will be using R, a popular programming language that is widely used among statisticians, quantitative methodologists, and data scientists. Our presentation will assume that readers are comfortable with writing R scripts to carry out tasks such as cleaning variables, summarizing data, creating data-based graphics, and running regression models (or more generally, estimating statistical models).

In this chapter, we introduce some essential programming concepts that may be less familiar to readers, but which are central to how we approach programming simulation studies. We also explain some of the rationale and reasoning behind how we present example code throughout the book.

2.1 Why R?

We have chosen to focus on R (rather than some other programming language) because both of us are intimately familiar with R and use it extensively in our day-to-day work. Simply put, it is much easier to write in your native language than in one in which you are less fluent. But beyond our own habits and preferences, there are several more principled reasons for using R.

R is free and open source software, which can be run under many different operating systems (Windows, Mac, Linux, etc.). This is advantageous not only because of the cost, but also because it means that anyone with a computer—anywhere in the world—can access the software and could, if they wanted, re-run our provided code for themselves. This makes R a good choice for practicing transparent and open science processes.

There is a very large, active, and diverse community of people who use, teach, and develop R. It is used widely for applied data analysis and statistical work in such fields as education, psychology, economics, epidemiology, public health, and political science, and is widely taught in quantitative methods and applied statistics courses. Integral to the appeal of R is that it includes tens of thousands of contributed packages, which extend the core functionality of the language in myriad ways. New statistical techniques are often quickly available in R, or can be accessed through R interfaces. Increasingly, R can also be used to interface with other languages and platforms, such as running Python code via the reticulate package, running Stan programs for Bayesian modeling via RStan, or calling the h2o machine learning library using the h2o package (Fryda et al. 2014). The huge variety of statistical tools available in R makes it a fascinating place to learn and practice.

R does have a persistent reputation as being a challenging and difficult language to use. This reputation might be partly attributable to its early roots, having been developed by highly technical statisticians who did not put great emphasis on accessibility, legibility of code, or ease of use. However, as the R community has grown, the availability of introductory documentation and learning materials has improved drastically, so that it is now much easier to access pedagogical materials and find help.

R’s reputation also probably partly stems from being a decentralized, open source project with many, many contributors. Contributed R packages vary hugely in quality and depth of development; there are some amazingly powerful tools available but also much that is half-baked, poorly executed, or flat out wrong. Because there is no central oversight or quality control, the onus is on the user to critically evaluate the packages that they use. For newer users especially, we recommend focusing on more established and widely used packages, seeking input and feedback from more knowledgeable users, and taking time to validate functionality against other packages or software when possible.

A final contributor to R’s intimidating reputation might be its extreme flexibility. As both a statistical analysis package and a fully functional programming language, R can do many things that other software packages cannot, but this also means that there are often many different ways to accomplish the same task. In light of this situation, it is good to keep in mind that knowing a single way to do something is usually adequate—there is no need to learn six different words for hello, when one is enough to start a conversation.

On balance, we think that the many strengths of R make it worthwhile to learn and continue exploring. For simulation, in particular, R’s facility to easily write functions (bundles of commands that you can easily call in different manners), to work with multiple datasets in play at the same time, and to leverage the vast array of other people’s work all make it a very attractive language.

2.2 Welcome to the tidyverse

Layered on top of R are a collection of contributed packages that make data wrangling and management much, much easier. This collection is called tidyverse and it includes popular packages such as dplyr, tidyr, and ggplot2. We use methods from the “tidyverse” throughout the book because it facilitates writing clean, concise code. In particular, we make heavy use of the dplyr package for group-wise data manipulation, the purrr package for functional programming, and the ggplot2 package for statistical graphics. The 1st edition or 2nd edition of the free online textbook R for Data Science provide an excellent, thorough introduction to these packages, along with much more background on the tidyverse. We will cite portions of this text throughout the book.

Loading the tidyverse packages is straightforward:

library( tidyverse )
options(list(dplyr.summarise.inform = FALSE))

(The second line is to turn off some of the persistent warnings generated by the dplyr function summarize().) These lines of code appear in the header of nearly every script we use.

2.3 Functions

If you are comfortable using R for data analysis tasks, you are likely familiar with many of R’s functions. R has function to do things like calculate a summary statistic from a list of numbers (e.g., mean(), median(), or sd()), calculate linear regression coefficient estimates from a dataset (lm()), or count the number of rows in a dataset (nrow()). In the abstract, a function is a little machine for transforming ingredients into outputs, like a microwave (put a bag of kernels in and it will return hot, crunchy popcorn), a cheese shredder (put a block of mozzarella in and it transforms it into topping for your pizza), or a washing machine (put in dirty clothes and detergent and it will return clean but damp clothes). A function takes in pieces of information specified by the user (the inputs), follows a set of instructions for transforming or summarizing those inputs, and then returns the result of the calculations (the outputs).

A function can do nearly anything as long as the calculation can be expressed in code—it can even produce output that is random. For example, the rnorm() function takes as input a number n and returns that many random numbers, drawn from a standard normal distribution:

rnorm(3)
## [1] -0.1833913 -0.4656615  0.6691481

Each time the function is called, it returns a different set of numbers:

rnorm(3)
## [1] -0.4005410  1.0681366 -0.5178925

The rnorm() function also has further input arguments that let the user specify the mean and standard deviation of the distribution from which numbers are drawn:

rnorm(3, mean = 10, sd = 0.5)
## [1] 10.284333  9.086475  9.441631

2.3.1 Rolling your own

In R, you can create your own function by specifying the pieces of input information, the steps to follow in transforming the inputs, and the result to return as output. Learning to write your own functions to carry out calculations is an immensely useful skill that will greatly enhance your ability to accomplish a range of tasks. Writing custom functions is also central to our approach to coding Monte Carlo simulations, and so we highlight some of the key considerations here. Chapter 19 of R for Data Science (1st edition) provides an in-depth discussion of how to write your own functions.

Here is an example of a custom function called one_run():

one_run <- function( N, mn ) {
  vals <- rnorm( N, mean = mn )
  tt <- t.test( vals )
  pvalue <- tt$p.value
  return(pvalue)
}

The first line specifies that we are creating a function that takes inputs N and mn. These are called the parameters, inputs, or arguments of the function. The remaining lines inside the curly brackets are called the body of the function. These lines specify the instructions to follow in transforming the inputs into an output:

  1. Generate a random sample of N observations from a normal distribution with mean mn and store the result in vals.
  2. Use the built-in function t.test() to compute a one-sample t-test for the null hypothesis that the population mean is zero, then store the result in tt.
  3. Extract the p-value from the t-test store the result in pvalue.
  4. Return pvalue as output.

Having created the function, we can then use it with any inputs that we like:

one_run( 100, 5 )
## [1] 1.553092e-72
one_run( 10, 0.3 )
## [1] 0.8752302
one_run( 10, 0.3 )
## [1] 0.448616

In each case, the output of the function is a p-value from a simulated sample of data. The function produces a different answer each time because its instructions involve generating random numbers each time it is called. In essence, our custom function is just a short-cut for carrying out its instructions. Writing it saves us from having to repeatedly write or copy-paste the lines of code inside the body.

2.3.2 A dangerous function

Writing custom functions will prove to be crucial for effectively implementing Monte Carlo simulations. However, designing custom functions does take practice to master. It also requires a degree of care above and beyond what is needed just to use R’s built-in functions.

One of the common mistakes encountered in writing custom functions is to let the function depend on information that is not part of the input arguments. For example, consider the following script, which includes a nonsensical custom function called funky():

secret_ingredient <- 3

funky <- function(input1, input2, input3) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

funky(3, 2, 5)
## [1] 5.5

funky takes inputs input1, input2, and input3, but its instructions also depend on the quantity secret_ingredient. What happens if we change the value of secret_ingredient?

secret_ingredient <- 100
funky(3, 2, 5)
## [1] 102.5

Even though we give it the same arguments as previously, the output of the function is different. This sort of behavior is confusing. Unless the function involves generating random numbers, we would generally expect it to return the exact same output if we give it the same inputs. Even worse, we get a rather cryptic error if the value of secret_ingredient is not compatible with what the function expects:

secret_ingredient <- "A"
funky(3, 2, 5)
## Error in input3 * ratio + secret_ingredient: non-numeric argument to binary operator

If we are not careful, we will end up with very confusing code that can produce unintended results.

To avoid this issue, it is important for functions to only use information that is explicitly provided to it through its arguments. This is the principle of isolating the inputs. If the result of a function is supposed to depend on a quantity, then we should include that quantity among the input arguments. We can fix our example function by including secret_ingredient as an argument:

secret_ingredient <- 3

funkier <- function(input1, input2, input3, secret_ingredient) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

funkier(3, 2, 5, 3)
## [1] 5.5

Now the output of the function is always the same, regardless of the value of other objects in R:

secret_ingredient <- 100
funkier(3, 2, 5, 3)
## [1] 5.5

The parameter secret_ingredient holds sway. If we want to try 100, we have to do so explicitly:

funkier(3, 2, 5, 100)
## [1] 102.5

When writing your own functions, it may not be obvious that your function depends on external quantities and does not isolate the inputs. In our experience, one of the best ways to detect this issue is to clear the R environment and start from a fresh palette, run the code to create the function, and call the function (perhaps more than once) to ensure that it works as expected. Here is an illustration of what happens when we follow this process with our problematic custom function:

# clear environment
rm(list=ls()) 

# create function
funky <- function(input1, input2, input3) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}

# test the function
funky(3, 2, 5)
## Error in funky(3, 2, 5): object 'secret_ingredient' not found

We get an error because the external quantity is not available. Here is the same process using our corrected function:

# clear environment
rm(list=ls()) 

# create function
funkier <- function(input1, input2, input3, secret_ingredient) {
  
  # do funky stuff
  ratio <- input1 / (input2 + 4)
  funky_output <- input3 * ratio + secret_ingredient
  
  return(funky_output)  
}


# test the function
funkier(3, 2, 5, secret_ingredient = 3)
## [1] 5.5

2.3.3 Using Named Arguments

When calling a function, you can specifically name which values correspond to which arguments. Named arguments greatly enhance the readability and flexibility of function calls. When you specify inputs by name, R matches the values to the arguments based on the names rather than the order in which they appear. This feature is particularly useful in complex functions with many optional arguments.

For example, say you had a function simulate():

simulate <- function( trials, model, seed ) {
  # Simulation code here
}

You could call simulate() with named arguments in any order:

result <- simulate(seed = 123, model = "normal", trials = 500)

In this call, R knows which value to assign to each argument, allowing the user to skip specifying arguments in their defined order.

Without naming, you would have to call in the order of the original function:

simulate( 500, "normal", 123 )

Getting in the habit of using named arguments will help you avoid errors. If you pass arguments without naming, and in the wrong order, you can end up with very strange results that are hard to diagnose. Even if you get it right, if someone later changes the function (say by adding a new argument in the middle of the list), your code will suddenly break with no explanation.

2.3.4 Argument Defaults

Default arguments allow you to specify typical values for parameters that a user might not need to change every time they use the function. This makes the function easier to use and less error-prone, as the defaults should ensure that the function behaves sensibly even when some arguments are not explicitly provided.

For example, let us revise the simulate() function from above to use default arguments:

simulate <- function(trials = 1000, model = "binomial", seed = NULL) {
  # Simulation code here
}

Now our function has a default for trials of 1000, for model of ‘binomial’, and for seed of NULL. This means a user can run a basic simulation simply by calling simulate() without any inputs. Doing so will perform 1000 trials using a binomial model without first setting a seed value.

Once we have our function with defaults, we can call the function while specifying only the inputs that differ from the default values:

# Customized simulation with a different model
custom_simulation <- simulate( model = "poisson" )

The function will use its default values for trials and seed. Using defaults lets the user call the function more succinctly.

Later chapters will have much more to say about—and many further illustrations of—the process of writing custom functions.

2.3.5 Function skeletons

In discussing how to write functions for simulations, we will often present function skeletons. By a skeleton, we mean code that creates a function with a specific set of input arguments, but where the body is left partially or fully unspecified. Here is a cursory example of a function skeleton:

run_simulation <- function( N, J, mu, sigma, tau ) {
  # simulate data
  # apply estimation procedure
  # repeat
  # summarize results
  return(results)
}

In subsequent chapters, we will use function skeletons to outline the organization of code for simulation studies. The skeleton headers make clear what the inputs to the function need to be. Sometimes, we will leave comments in the body of the skeleton to sketch out the general flow of calculations that need to happen. Depending on the details of the simulation, the specifics of these steps might be quite different, but the general structure will often be quite consistent. Finally, the last line of the skeleton indicates the value that should be returned as output of the function. Thus, skeletons are kind of like Mad Libs, but with R code instead of parts of speech.

2.4 %>% (Pipe) dreams

Many of the functions from tidyverse packages are designed to make it easy to use them in sequence via the %>% symbol, or pipe, provided by the magrittr package. The pipe allows us to compose several functions, meaning to write a chain of several functions as a sequence, where the result of each function becomes the first input to the next function. In code written with the pipe, the order of function calls follows like a story book or cake recipe, making it easier to see what is happening at each step in the sequence.

Consider the hypothetical functions f(), g(), and h(), and suppose we want to do a calculation that involves composing all three functions. One way to write this calculation is

res1 <- f(my_data, a = 4)
res2 <- g(res1, b = FALSE)
result <- h(res2, c = "hot sauce")

We have to store the result of each intermediate step in an object, and it takes a careful read of the code to see that we are using res1 as input to g() and res2 as input to h().

Alternately, we could try to write all the calculations as one line:

result <- h( g( f( my_data, a = 4 ), b = FALSE ), c = "hot sauce" )

This is a mess. It takes very careful parsing to see that the b argument is called as part of g() and the c argument is part of h()`, and the order in which the functions appear is not the same as the order in which they are calculated.

With the pipe we can write the same calculation as

result <- 
  my_data %>%        # initial dataset
  f(a = 4) %>%       # do f() to it
  g(b = FALSE) %>%   # then do g()
  h(c = "hot sauce") # then do h()

This addresses the all the issues with our previous attempts: the order in which the functions appear is the same as the order in which they are executed; the additional arguments are clearly associated with the relevant functions; and there is only a single object holding the results of the calculations. Pipes are a very nice technique for writing clear code that is easy for others to follow.2

2.5 Recipes versus Patterns

As we will elaborate in subsequent chapters, we follow a modular approach to writing simulations, in which each component of the simulation is represented by its own custom function or its own object in R. This modular approach leads to code that always has the same broad structure and where the process of implementing the simulation follows a set sequence of steps. We start by coding a data-generating process, then write one or more data-analysis methods, then determine how to evaluate the performance of the methods, and finally implement an experimental design to examine the performance of the methods across multiple scenarios. Over the next several chapters, we will walk through this process several times.

Although we always follow the same broad process, the case studies that we will present are not intended as a cookbook that must be rigidly followed. In our experience, the specific features of a data-generating model, estimator, or research question sometimes require tweaking the template or switching up how we implement some aspect of the simulation. And sometimes, it might just be a question of style or preference. Because of this, we have purposely constructed the examples presented throughout the book to use different variations of our central theme rather than always following the exact same style and structure. We hope that presenting these variants and adaptations will both expand your sense of what is possible and also help you to recognize the core design principles—in other words, to distinguish the forest from the trees. Of course, we would welcome and encourage you to take any of the code verbatim, tweak and adapt it for your own purposes, and use it however you see fit. Adapting a good example is usually much easier than starting from a blank screen.

References

Fryda, Tomas, Erin LeDell, Navdeep Gill, Spencer Aiello, Anqi Fu, Arno Candel, Cliff Click, et al. 2014. “H2o: R Interface for the ’H2OScalable Machine Learning Platform.” Comprehensive R Archive Network. https://doi.org/10.32614/CRAN.package.h2o.

  1. Chapter 18 of R for Data Science (1st edition) provides more discussion and examples of how to use %>%.↩︎