A Coding tidbits
This chapter is not about simulation, but does have a few tips and tricks regarding coding that are worth attending to.
A.1 How to repeat yourself
At the heart of simulation is replication: we want to do the same task over and over. In this book we have showcased a variety of tools to do it. In this section we give a formal presentation of these tools.
A.1.1 Using replicate()
The replicate( n, expr, simplify )
method is a base-R method, where you give it a number n
and an expression expr
to run that many times. You can set simplify = FALSE
to get the results as a list, and if you set simplify = TRUE
then R will try to simplify your results by, as a default, putting it into an array.
For simple tasks where your expression gives you a single number, replicate will give you a list of numbers:
## [1] 0.3333333 0.3333333 1.0000000 2.0000000 0.6666667
If you do not simplify, you then will need to massage your results:
one_run <- function() {
dd = rpois( 3, lambda = 1 )
tibble( mean = mean( dd ), sd = sd( dd ) )
}
rps <- replicate( 2, one_run(), simplify = FALSE )
rps
## [[1]]
## # A tibble: 1 × 2
## mean sd
## <dbl> <dbl>
## 1 1 1
##
## [[2]]
## # A tibble: 1 × 2
## mean sd
## <dbl> <dbl>
## 1 1 1.73
In particular, you will probably stack all your tibbles to make one large dataset:
## # A tibble: 2 × 2
## mean sd
## <dbl> <dbl>
## 1 1 1
## 2 1 1.73
Note that you give replicate a full piece of code that would run on its own. The replicate()
method is good for simple tasks, but for more general use, you will probably want to use map()
.
A.1.2 Using map()
The tidyverse way of repeating oneself is the map()
method. The nice thing about map()
is you map over a list of values, and thus can call a function repeatedly, but with a shifting set of inputs.
one_run_v2 <- function( N ) {
dd = rpois( N, lambda = 1 )
tibble( mean = mean( dd ), sd = sd( dd ) )
}
n_list = c(2, 5)
rps <- map( n_list, one_run_v2 )
rps
## [[1]]
## # A tibble: 1 × 2
## mean sd
## <dbl> <dbl>
## 1 0 0
##
## [[2]]
## # A tibble: 1 × 2
## mean sd
## <dbl> <dbl>
## 1 1.6 0.894
You again would want to stack your results:
## # A tibble: 2 × 2
## mean sd
## <dbl> <dbl>
## 1 0 0
## 2 1.6 0.894
We have a small issue here, however, which is we lost what we gave map()
for each call.
If we know we only get one row back from each call, we can add the column directly:
A better approach is to name your list of input parameters, and then your map
function can add those names for you as a new column when you stack:
## # A tibble: 2 × 3
## n mean sd
## <chr> <dbl> <dbl>
## 1 2 1 0
## 2 5 1.2 0.837
An advantage here is if you are returning multiple rows (e.g., one row for each estimator tested in a more complex simulation), all the rows will get named correctly and automatically.
In older tidyverse worlds, you will see methods such as map_dbl()
or map_dfr()
. These will automatically massage your output into the target type. map_dfr()
will automatically bind rows, and map_dbl()
will try to simplify the output into a list of doubles. Modern tidyverse no longer likes this, which we find somewhat sad.
To read more about map()
, check out out Section 21.5 of R for Data Science (1st edition), which provides a more thourough introduction to mapping.
A.1.3 map with no inputs
If you do not have parameters, but still want to use map()
, you can. E.g.,
## # A tibble: 3 × 2
## mean sd
## <dbl> <dbl>
## 1 0.667 1.15
## 2 1 1.73
## 3 0.667 0.577
The weird “(.)” is a shorthand for a function that takes one argument and then calls one_run()
with no arguments. We are using the 1:3 notation to just make a list of the right length (3 replicates, in this case) to map over. A lot of fuss! Just use replicate()
To make all of this more clear, consider passing arguments that you manipulate on the fly:
## # A tibble: 2 × 2
## mean sd
## <dbl> <dbl>
## 1 0.5 1
## 2 0.96 0.790
Anonymous functions, as these are called, can be useful to connect your pieces of simulation together.
A.1.4 Other approaches for repetition
In the past, there was a tidyverse method called rerun()
, but it is currently out of favor.
Originally, rerun()
did exactly that: you gave it a number and a block of code, and it would rerun the block of code that many times, giving you the results as a list.
rerun()
and replicate()
are near equivalents.
As we saw, replicate()
does what its name suggests—it replicates the result of an expression a specified number of times. Setting simplify = FALSE
returns the output as a list (just like rerun()
.
A.2 Default arguments for functions
To generate code both easy to use and configure, use default arguments. For example,
## [1] 1020
## [1] 520
## [1] 1005
## [1] 105
We can still call my_function()
when we don’t know what the arguments are, but then when we know more about the function, we can specify things of interest.
Lots of R commands work exactly this way, and for good reason.
Especially for code to generate random datasets, default arguments can be a lifesaver as you can then call the method before you know exactly what everything means.
For example, consider the blkvar
package that has some code to generate blocked randomized datasets.
We might locate a promising method, and type it in:
## Error in generate_blocked_data(): argument "n_k" is missing, with no default
That didn’t work, but let’s provide some block sizes and see what happens:
## B Y0 Y1
## 1 B1 1.20450363 6.283162
## 2 B1 0.09543094 5.087278
## 3 B1 -0.64219372 4.315043
## 4 B2 -1.30950461 4.004157
## 5 B2 0.40392886 5.386730
Nice! We see that we have a block ID and the control and treatment potential outcomes. We also don’t see a random assignment variable, so that tells us we probably need some other methods as well. But we can play with this as it stands right away.
Next we can see that there are many things we might tune:
## function (n_k, sigma_alpha = 1, sigma_beta = 0, beta = 5, sigma_0 = 1,
## sigma_1 = 1, corr = 0.5, exact = FALSE)
## NULL
The documentation will tell us more, but if we just need some sample data, we can quickly assess our method before having to do much reading and understanding. Only once we have identified what we need do we have to turn to the documentation itself.
A.3 Testing and debugging code in your scripts
If you have an extended script with a list of functions, you might have a lot of code that runs each function in turn, so you can easily remind yourself of what it does, or what the output looks like. One way to keep this code around, but not have it run all the time when you run your script, is to put the code inside a “FALSE block,” that might look like so:
if ( FALSE ) {
res <- my_function( 10, 20, 30 )
res
# Some notes as to what I want to see.
sd( res )
# This should be around 20
}
You can then, when looking at the script, paste the code inside the block into the console when you want to run it. If you source the script, however, it will not run at all, and thus your code will source faster and not print out any extraneous output.
A.4 Keep multiple files of code
Simulations have two general phases: generate your results and analyze your results. The ending of the first phase should be to save the generated results. The beginning of the second phase should then be to load the results from a file and analyze them. These phases can be in a seperate ‘.R’ files This allows for easily changing how one analyzes an experiment without re-running the entire thing.
A.5 The source command and keeping things organized
Once you have your multifactor simulation, if it is a particularly complex one, you will likely have three general collections of code:
- Code for generating data
- Code for analyzing data
- Code for running a single simulation scenario
If each of these pieces is large and complex, you might consider putting them in three different .R
files.
Then, in your primary simulation, you would source these files.
E.g.,
source( "pack_data_generators.R" )
source( "pack_estimators.R" )
source( "pack_simulation_support.R" )
You might also have pack_simulation_support.R
source the other two files, and then source the single simulation support file in your primay file.
One reason for putting code in individual files is you can then have testing code in each of your files (in False blocks, like described above), testing each of your components. Then, when you are not focused on that component, you don’t have to look at that testing code.
Another good reason for this type of modular organizing is you can then have a variety of data generators, forming a library of options. You can then easily create different simulations that use different pieces, in a larger project.
For example, in one recent simulation project on estimators for an Instrumental Variable analysis, we had several different data generators for generating different types of compliance patterns (IVs are often used to handle noncompliance in randomized experiments). Our data generation code file then had several methods:
> ls()
[1] "describe_sim_data" "make_dat" "make.dat.1side"
[4] "make.dat.1side.old" "make.dat.orig" "make.dat.simple"
[7] "make.dat.tuned" "rand.exp" "summarize_sim_data"
The describe and summarize methods printed various statistics about a sample dataset; these are used to debug and understand how the generated data looks. We also had a variety of different DGP methods because we had different versions that came up as we were trying to chase down errors in our estimators and understand strange behavior.
Putting the estimators in a different file also had a nice additional purpose: we also had an applied data example in our work, and we could simply source that file and use those estimators on our actual data. This ensured our simulation and applied analysis were perfectly aligned in terms of the estimators we were using. Also, as we debugged our estimators and tweaked them, we immediately could re-run our applied analysis to update those results with minimal effort.
Modular programming is key.
A.6 Debugging with browser
Consider the code taken from a simulation:
The browser()
command stops your code and puts you in an interactive console where you can look at different objects and see what is happening.
Having it triggered when something bad happens (in this case when a set of estimates has an unexpected NA) can help untangle what is driving a rare event.