Chapter 17 Organizing a simulation project

Multi-factor simulations can easily grow into quite complex projects with many moving pieces. As a project grows in complexity, the steady accretion of functions, scripts, stored results, and analysis can become overwhelming—to the point that you would dread sharing these work products with anyone. Fortunately, a bit of forethought and advance planning can make these types of projects much more tractable and manageable.

In the next several chapters, we describe some organizing principles and programming practices that will make it easier to handle complex simulation projects. In this chapter, we discuss project organization, file management, and storing simulation results. In Chapter 18, we demonstrate parallel processing methods that will help you to speed up the computation involved in multifactor simulations. Finally, in Chapter 19, we introduce some basic debugging and testing techniques that we have found useful for keeping on top of complex code bases.

17.1 Simulation project structure

In a very simple simulation that does not take long to execute, you might keep all of the code, output, and analysis in a single file. However, if any of the computations take substantial time to execute, keeping everything in one file will become unwieldy. For more complex or more computationally demanding simulations, there are big benefits to organizing your code across more than one file.

Multifactor simulation studies usually involve three distinct phases of work, each involving a different set of processes and different types of code. The first phase involves developing functions for generating data, applying estimators or data analysis procedures to the data, calculating performance measures, and running the full simulation for a single scenario. The end-product of this phase is a set of functions or methods for executing a simulation. The second phase involves running the simulations across multiple scenarios, as discussed in Chapter 10. The end-product of this phase is a set of results, containing estimates of performance measures for one or more methods under each condition. The third phase involves analyzing the simulation results and drawing conclusions about the performance of the method or methods under evaluation. The end-product of this phase is often a set of figures, tables, and text (perhaps taking the form of a memo, blog post, slide deck, or manuscript) that summarizes what you find.

To stay organized when conducting a larger, multifactor simulation, we find it generally useful to keep a clear separation between phases. In practice, this means keeping the code for each phase in a separate file (or set of files) and storing the results from each phase so that code for subsequent phases can be re-run or revised without re-executing the entire code base. This approach is in keeping with a general principle for organizing large computational projects: keep the code for distinct tasks in different files.

In this chapter, we describe strategies for organizing the code and work products involved in a multifactor simulation study. We start by offering guidance and recommendations for how to structure individual files and make use of code that is organized across more than one file. We then describe a directory structure that encourages a clear separation between phases of a project. Finally, we discuss strategies for storing and organizing results when executing a simulation. The advice and recommendations we offer are drawn from our own experience working on large simulation projects (including the many mistakes we have made and lessons we have learned through trial-and-error). Of course, we readily acknowledge that the approach we describe here is not the only way to do things, and we make no claims of optimality. As you gain experience with your own work, you will develop your own systems, habits, and strategies. We offer this guidance as a starting point from which to build.

17.2 Well structured code files

Conducting a simulation will require writing, testing, and revising code (potentially quite a bit of code!). There are two main types of files that analysts use to hold R code; plain .R scripts and R markdown (.Rmd) or Quarto (.qmd) notebooks. The former can only contain R syntax and comments, whereas the latter can hold a mix of R code chunks and written text. The latter work as the source code for creating reproducible reports. Compiling a notebook will automatically run the embedded code and interweave the results with the text, producing a formatted document that can display figures, tables, and other forms of output.

For very small projects, it may be possible to store your entire code base in a single notebook, containing all of your functions, code for creating a simulation design and executing the simulations, and code for analyzing the results. An advantage of using a notebook here is that you can add plain text explanations and descriptions of the component pieces mixed in with the actual source code, so that rationales and design decisions are fully documented. Further, you can re-run the entire simulation at the click of a button, simply by re-compiling the the notebook. However, it is difficult to work this way if the simulation involves more intensive computation.

For larger or more computationally demanding projects, we find it useful to make more use of the first type of file, the humble .R script. Ideally, you should follow the principles of modular programming for the code in these files. Each .R file should hold a collection of code that is related to a single task. Accordingly, we distinguish between two types of of .R scripts: those that only contain functions that can be used for other purposes and those that carry out numerical calculations. The former type just holds functions, so that the only consequence of running them from top to bottom will be to load some new functions into your workspace. The latter type are traditional scripts that do calculations, store or load data files, and create graphs and other summaries of data.

17.2.1 Putting headers in your .R file

When you write an .R script, it is a good idea to put a header at the top of the file that describes the file’s purpose. Within the file, you can also describe the contents with shorter section headers. For instance, a very simple section header might look like this:

#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
# Data generating functions ----
#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#

Note the ---- at the end of the middle line. Using four trailing dashes trailing dashes indicates a section of the code that RStudio’s parser can interpret. If you click on the dropdown at the bottom of the RStudio source pane, you will see a pop-up table of contents that allows you to quickly navigate to different parts of your source file. You can also see the same table of contents by clicking the Outline button in the top right corner of the source pane.

17.2.2 The source command

If your code is organized across more than one file, you will need a way to call one file from another. This is the purpose of the source() command, which effectively cuts-and-pastes the contents of the given file into your R session. For example, the following code runs three separate R scripts in turn:

source( here::here( "R/data_generators.R" ) )
source( here::here( "R/estimators.R" ) )
source( here::here( "R/simulation_support.R" ) )

If the named file has code to run, it will run it. If the named file collects a set of methods, those methods will now be available for use. The here::here() command is a convenience function that allows you to specify a file path relative to your R project root directory, so you can easily find your files.23

The source() command is simple and literal—it runs a specified script from top to bottom, regardless of the contents. In particular, if the sourced script includes further calls of source(), the referenced code will be run. For example, the simulation_support.R script could include calls to source the other two files. In this case, you would only need to source the single simulation_support.R file to load the contents of all three files. Although this might seem convenient, it does create some constraints as well. With this structure, there would be no way to source the contents of simulation_support.R without also sourcing data_generators.R and estimators.R. For this reason, we tend to avoid using source() in scripts that might themselves be sourced.

One reason for storing code in individual files and using source() is that you can then include testing code in each of your files to both demonstrate the syntax and check the correctness of your functions. Then, when you are not focused on that particular function or component, you don’t have to look at that testing code. Another good reason for this type of modular organization is that is supports developing a broader simulation universe, potentially with a variety of data generating functions. With a whole library of options, you can then readily run multiple simulations that involve some different components and some shared features.

We followed this approach in one recent simulation project examining estimators for an instrumental variable analysis, a common approach to handling non-compliance in randomized experiments. In this project, we wrote several different data generating functions that implemented different types of non-compliance patterns. Our data_generators.R code file then had several methods. When we sourced it, we end up with the following methods available to call:

> ls()
[1] "describe_sim_data"  "make_dat"           "make.dat_1side"     
[4] "make_dat_1side_old" "make_dat_orig"      "make_dat_simple"
[7] "make_dat_tuned"     "rand_exp"           "summarize_sim_data"

The describe() and summarize() methods printed various statistics about a sample dataset; we used used these to examine generated datasets and debug the data generating functions. We also had a variety of different data generating methods, which we developed over the course of the project as we were trying to chase down errors in our estimators and understand strange behavior.

For this project, we stored estimation functions in a separate file from the data-generating functions. Doing so made it easier to use these functions for purposes other than running simulations. The project also involved doing some empirical data analysis, and so we could simply source the file with our estimation functions and apply them to a real dataset. This ensured that our simulation and applied analysis were exactly aligned in terms of the estimators we were using. As we debugged and tweaked our estimators, we could immediately re-run our applied analysis to update the results, without worrying about whether the calculations were up-to-date and consistent with the simulation code.

17.2.3 Storing testing code

If you have an extended .R file with a collection of functions, you might also want to store some code that runs each function in turn, so you can easily remind yourself of what it does, or what the output looks like. One way to keep this code around, but not have it run all the time when you run your script, is to put the code inside a “FALSE block,” that might look like so:

# My testing code ----
if ( FALSE ) {
  res <- my_function( 10, 20, 30 )
  res
  # Some notes as to what I want to see.
  
  sd( res )
  # This should be around 20
}

When you open and look at this script, you can then paste the code inside the block into the console when you want to run it. However, if you source() the script, the FALSE block will not run at all, so you can avoid extraneous output or computations that are not actually needed. This is a good way to keep simple demo code or testing code in close proximity to the function it is testing. When you want to work on just the part of the project captured by your script, you can work inside the single file very easily, ignoring the other parts of your project.

In previous chapters, we have emphasized the importance of writing code to validate that your functions work as intended. Sometimes, such validation code will involve more than just a quick call to the function. In such cases, it can be beneficial to organize validation code in its own file, separate from the function or functions that it tests. You would then need to source() the files containing the function to be tested. This approach does have the drawback that the testing code is not in close proximity to the function to be tested, so you need to work across multiple files when developing and running the test code. However, unlike with the FALSE blocks, organizing your test code in a separate file makes it easier to re-run a collection of several tests. This approach to organizing becomes more appealing when the validation code involves calls to functions that are themselves stored in separate files or when it involves invoking packages that are not needed for the main calculations involved in the simulation. It is also closer to how you might work when developing code for an R package, with unit tests that check the correctness of critical functions.24

17.3 Principled directory structures

Tidy home, tidy mind, as the saying goes. With any computational project, our home is usually a directory on our computer (or in a cloud). For multifactor simulation studies, we strongly advocate using an organized and clearly labelled directory structure, which will facilitate more intentional—and easier to follow—coding practices. We recommend using an integrated development environment such as RStudio, Positron, or VS Code and building a directory structure as follows:

my_project/
  proj.Rproj
  README.md
  R/
  test/
  data/
  scripts/
  results/

If you have ever looked at the source code for or developed your own R package, this structure will look familiar.

  • The R/ directory is where to put the core R code for your simulation. It should contain one or more .R scripts that hold the main functions for implementing a tidy simulation, including data-generating functions, analysis functions, performance calculation functions, and a simulation driver that pulls all the pieces together. With all of these functions saved in the same directory, you can then source() the scripts as needed to gain access to your functions. You should not put the scripts that actually execute the simulation or analyze the results in this folder. Instead, the R/ folder is reserved for the building blocks of the simulation.

  • The test/ directory is where to put any code for testing the functions and methods you have developed. You could even write formal unit tests for your methods using a testing framework such as the testthat package, and put those in the test/ directory.

  • The data/ directory is where to put any data files used in executing the simulation. For instance, a simulation of a cluster-randomized experiment might make use of an empirical dataset containing features such as cluster sizes or covariate values, which are sampled when generating artificial datasets. This directory is reserved for source data, which may be read in as part of the simulation. It should not contain any files that are created as part of executing the simulation.

  • The scripts/ directory is where to put scripts that actually run the simulations and analyze simulation results. Some of these scripts will make use of functions saved in the R/ directory. You will likely have at least one script that runs the simulation and at least one script for analyzing the results.

  • The results/ directory is where to save any generated results of your simulation. This directory should only contain files that have been created by running code in the scripts/ directory. Sometimes it might be worth having separate directories raw_results and results, where raw_results holds output created from running the simulation driver, and results holds final results produced from merging and summarizing the raw results.

Using this structure will help you to maintain reproducibility of the full project. If any files in data/, R/, or scripts/ are changed, then some or all of the files in results/ may need to be updated. In principle, it should be possible to delete all of the files in results/ and recreate them in full by re-running the files in the scripts/ directory.

17.4 Saving simulation results

Multifactor simulations can be error-prone and time-consuming to run, and analyzing raw simulation results often involves several rounds of iteration and refinement. Because of this, it pays to be cautious and save raw simulation results, so that they can be accessed and analyzed without having to re-run the code that generated them. As a bare minimum for any multifactor simulation, we recommend saving the complete set of results to a file. In many instances, it may make sense to go further by saving results for each unique condition as soon as it is complete. The latter approach provides an additional level of protection in the event that you need to re-run parts of the simulation or if the code crashes out in rare circumstances or for only a subset of the conditions.

17.4.1 File formats

An initial consideration is what format to use when saving simulation results. Two main candidates are the generic .csv format or the R-specific .rds format.

The first option is to save each dataset as a comma-separated value file. For instance, after your simulation has completed, you could save it in the file results/simulation_CRT.csv like so:

dir.create("results", showWarnings = FALSE)
write_csv( res, "results/simulation_CRT.csv" )

The write_csv() function comes from the readr package, which is part of the tidyverse suite. It takes a dataset and a file name as input and creates a comma-separated value file containing the contents of the dataset.25 Once stored, you can load the dataset back into R for analysis using the read_csv() function:

res <- read_csv( "results/simulation_CRT.csv" )

A key advantage of storing results as .csv is that it is a general format, so the file can be read by others even if they are not familiar with R. The format also makes for convenient viewing in a spreadsheet program. However, storing results in this format will only work for plain, rectangular datasets with no special features (such as list-columns). Further, saving a dataset as .csv and then reading it back into R results in a loss of meta-data, such as variable labels and column data-types.26

A second option is to use the saveRDS() and readRDS() methods. For instance, you could save a dataset containing simulation results in the file results/simulation_CRT.rds like so:

dir.create("results", showWarnings = FALSE)
saveRDS( res, "results/simulation_CRT.rds" )

The saveRDS() command is a base R function that takes an R object and a file name as input and creates a file that stores the object in a serialized (usually binary) format. Once stored, you can load the object back into R for analysis using the readRDS() function:

res <- readRDS( "results/simulation_CRT.csv" )

One major advantage of storing results in .rds format is that they can be reloaded in the exact same form in which they were saved. There is no conversion between file formats, and so no loss of meta-data about variable formats, labels, or the like. Another advantage is that .rds can hold any type of R object—whether it be a dataset, a function, a fitted model. Because of this, it can be used to save tibbles even if they included nested listed-columns or other more exotic structures. The main disadvantage is that the contents of the file cannot be interpreted without reading it back into R.

17.4.2 Saving simulations as you go

For some projects involving multifactor simulations with multiple data-generating conditions, it can be useful to save results from each condition in its own file. You then need to read in and combine the results across files to get a final dataset containing all the results. This is a prudent step to take if you are not sure you will have time to run the entire simulation, if you are worried that your R session might crash half way through because of an error, or if you are trying to debug an error that only occurs under certain conditions. This approach to saving results is also effective when running simulations in parallel, as we will discuss in Chapter 18. With a bit of further machinery, it makes it possible to selectively delete files and rerun only parts of a larger simulation.27

We will illustrate this approach to storing results by revisiting the simulation study on confidence intervals for Pearson’s correlation coefficient under a bivariate Poisson distribution, building on our development in Chapter 10. This simulation involved four different factors, each with multiple levels, with data-generating conditions created as follows:

params <- expand_grid( 
  N = c(10, 20, 30),
  mu1 = c(4, 8, 12),
  lambda = c(0.5, 1.0),
  rho = seq(0.0, 0.7, 0.1) 
) %>%
  mutate( mu2 = mu1 * lambda )

In Chapter 10, we used bundle_sim() to create a simulation driver for executing the simulation for a given scenario and then used evaluate_by_row() to call the simulation driver for every condition listed in params:

library(simhelpers)

Pearson_sim <- bundle_sim( f_generate = r_bivariate_Poisson, f_analyze = r_and_z, f_summarize = evaluate_CIs )

sim_results <- evaluate_by_row( params, Pearson_sim, reps = 100 )

To store the results from each condition in its own file, we need to modify the simulation driver to save its result to file rather than just returning it. One way to do so is to wrap Pearson_sim() inside another function:

Pearson_sim_save <- function(..., filename = NA_character_) {
  res <- safely(Pearson_sim)(...)
  if (is.null(res$error)) saveRDS(res$result, file = filename)
  return(res$error)
}

This function calls the simulation driver, Pearson_sim() using any arguments in the .... To prevent an error from stopping the entire simulation, we have wrapped the simulation driver in safely() to trap errors. As long as Pearson_sim() does not return an error, it saves the results from the simulation in a file with the user-specified filename. The function returns any errors as output so that it they can examined to figure out what went wrong.

To use the function, we’ll first need to create a file name for every condition in the simulation. We also create a unique seed for each condition, which will get passed to the simulation driver to ensure computational reproducibility.

params <- 
  params %>%
  mutate(
    seed = 20260129 + row_number(),
    filename = paste0("results/pearson-sim/pearson-row", seed, ".rds")
  ) %>%
  select(-lambda)

We can then call evaluate_by_row() with Pearson_sim_save() as the simulation driver:

sim_errors <- evaluate_by_row(params, Pearson_sim_save, reps = 100 )

This will create a set of 144 files containing the simulation results from every condition. The object sim_errors will include a column containing the error messages from any conditions that did not complete. In the event that some conditions produced errors or if we had to stop execution of the simulations before every condition was run, some of the results will still be saved, and we will only need to re-run the remaining conditions.

To illustrate how to re-run a subset of results, suppose that we had only been able to run and save results for the first three conditions. The directory containing simulation results will then only contain three files:

completed_rows <- list.files("results/pearson-sim", full.names = TRUE)
completed_rows
## [1] "results/pearson-sim/pearson-row20260130.rds"
## [2] "results/pearson-sim/pearson-row20260131.rds"
## [3] "results/pearson-sim/pearson-row20260132.rds"

To avoid repeating the simulations for these conditions, we can screen them out before running evaluate_by_row() again, as in:

sim_errors <- 
  params %>%
  filter(!(filename %in% completed_rows)) %>%
  evaluate_by_row(Pearson_sim_save, reps = 100 )

Once all the conditions are run, the directory results/pearson-sim will have a set of files containing all the simulation results. The only thing left to do is read these files in and combine them into a single dataset. To do so, we will first get the names of the files with list.files() and check that everything is complete

completed_rows <- list.files("results/pearson-sim", full.names = TRUE)
setdiff(params$filename, completed_rows)
## character(0)

We can then read in the files using map() to call readRDS() on each file:

sim_results <- 
  params %>%
  # ensure we only read in completed conditions
  filter(filename %in% completed_rows) %>%
  # read in results from every condition
  mutate(
    res = map(filename, readRDS)
  ) %>%
  unnest(res) %>%
  select(-filename)

The dataset sim_results now has all the results from every condition evaluated, organized by stacking the fragments from each of the individual conditions. We use the filename variable stored in the list of simulation conditions so that the parameter settings of each condition are available in the same dataset as the results. To avoid having to repeat the process or reading in the files, we could also save the full set of results in a separate .rds file:

saveRDS(sim_results, file = "results/pearson-simulation-all-results.rds")

  1. If you are used to using RStudio projects, you might wonder why you should use here::here(). If you always open your project from the .Rproj file, then the working directory will always start out set to the root directory, but it will not necessarily remain so (if a script calls setwd(), for instance). Further, when .Rmd or .qmd notebooks are compiled, the working directory is temporarily changed to the location of the notebook, so relative file references will work differently than in plain .R scripts. Wrapping file names in here::here() ensures that they will always be evaluated relative to the root directory of the project.↩︎

  2. For more on unit testings, see Chapters 13 through 15 of @Wickham2023packages.↩︎

  3. See Section 11.5 of the R for Data Science textbook [@Wickham2023data] for further details.↩︎

  4. For example, a variable that was initially created with factor() will be converted into a plain character vector, and the ordering of the factor levels will be lost.↩︎

  5. Of course this requires some care to ensure everything is still reproducible. Before you run off and publish results from the simulation, you might want to rerun everything from scratch and ensure everything is up to date, to avoid potentially embarrassing errors.↩︎