Chapter 18 Parallel Processing

Especially if you take our advice of “when in doubt, go more general” and if are running enough replicates to get nice and samll Monte Carlo errors, you will quickly come up against the computational limits of your computer. Simulations can be incredibly computationally intensive, and there are a few means for dealing with that. The first is to optimize code by removing extraneous calculation (e.g., by writing methods from scratch rather than using the safety-checking and thus sometimes slower methods in R, or by saving calculations that are shared across different estimation approaches) to make it run faster. This approach is usually quite hard, and the benefits often minimal; see Appendix @ref(optimize-code) for further discussion and examples. The second is to use more computing power by making the simulation parallel. This latter approach is the topic of this chapter.

Parallel computation is where you have multiple computers working on a problem at the same time. Simulations are very easy to do in parallel. With a multifactor experiment it is very easy to break apart the overall into pieces. For example, each computer could do one of your scenarios in a multi-factor simulation. Even without a multifactor experiment, due to the cycle of “generate data, then analyze,” it is easy to have a bunch of computers doing the same thing, as long as they are generating different datasets, with a final collection step where all the individual iterations are combined into a single set at the end. In either case, the total work is the same, but since each computer is doing a piece at the same time, the overall simulation will be completed much sooner.

There are three general ways to do parallel calculation. The first is to take advantage of the fact that most modern computers come with multiple cores (i.e., computers) built in. For this simplest approach, we simply tell R to use more of this processing power. If your computer has eight cores, you can easily get a near eight-fold increase in the speed of your simulation.

The second is to use cloud computing, putting your simulation on a virtual computer that might have 50 or more cores. Other than having to handle moving files back and forth from your machine to the cloud, this is not much different from the first approach: you are still just telling R to use more cores, but now you have more cores to use. Even if the speed-up itself is not massive, this approach also takes the computing off your computer entirely, making it easier to set up a job to run for days or weeks without making your day to day life any more difficult.

The third way is to use a computing cluster. A cluster is a network of hundreds or thousands of computers, coupled with commands for breaking apart a simulation into pieces and sending the pieces to that vast army. Conceptually, this approach is the same as when you do baby parallel on your desktop: more cores equals more simulations per minute and thus faster simulation overall. But the interface to a cluster can be a bit tricky, and very cluster-dependent.

But once you get a cluster up and running, it can be a very powerful tool. Not only is it off your personal computer, a cluster can potentially give you hundreds of cores, which means a speed-up of hundreds rather than four or eight.

18.1 Parallel on your computer

Most modern computers have multiple cores, so you can run a parallel simulation right in the privacy of your own home!

To assess how many cores you have on your computer, you can use the detectCores() method in the parallel package:

parallel::detectCores()
## [1] 4

Normally, unless you tell it to do otherwise, R only uses one core. This is obviously a bit lazy on R’s part. But it is easy to take advantage of multiple cores using the future and furrr packages.

library(future)
library(furrr)

In particular, the furrr package replicates our map functions, but in parallel. We first tell our R session what kind of parallel processing we want using the future package. In general, using plan(multisession) is the cleanest: it will start one entire R session per core, and have each session do work for you. The alternative, multicore does not seem to work well with Windows machines, nor with RStudio in general.

The call is simple:

plan(multisession, workers = parallel::detectCores() - 1 )

The workers parameter specifies how many of your cores you want to use. Using all but one will let your computer still operate mostly normally for checking email and so forth. You are carving out a bit of space for your own adventures.

Once you set up your plan, you use future_pmap(); it works just like pmap() but evaluates across all available workers specified in the plan call. Here we are running a parallel version of the multifactor experiment discussed in Chapter @ref(exp_design) (see chapter @ref(case_Cronback) for the simulation itself).

tictoc::tic()
params$res = future_pmap(params,
                         .f = run_alpha_sim,
                         .options = furrr_options(seed = NULL))
tictoc::tic()

Note the .options = furrr_options(seed = NULL) part of the argument. This is to silence some warnings. Given how tasks are handed out, R will get upset if you do not do some handholding regarding how it should set seeds for pseudoranom number generation. In particular, if you do not set the seed, the multiple sessions could end up having the same starting seed and thus run the exact same simulations (in principle). We have seen before how to set specific seed for each simulation scenario, but furrr doesn’t know we have done this. This is why the extra argument about seeds: it is being explicit that we are handling seed setting on our own.

We can compare the running time to running in serial (i.e. using only one worker):

tictoc::tic()
params$res2 = dplyr::select(params, n:seed) %>%
  pmap(.f = run_alpha_sim)
tictoc::tic()

(The select command is to drop the res column from the parallel run; it would otherwise be passed as as parameter to run_alpha_sim which would in turn cause an error due to the unrecognized parameter.)

18.2 Parallel on a virtual machine

Your laptop probably has around 8 cores, meaning you can have an 8 fold speed-up. But wouldn’t it be nice to have a computer with 50 cores? Or even more? You can get one!

Cloud services, such as Amazon Web Services (AWS), can give you a virtual machine that can have many cores (where many is usually 50 or so). Some of these services give you what is effectively an R Studio session, and so you can run your scripts and everything just like we have discussed above, with no change.

The Data Colada blog advocates for an alternative to AWS, which is [Kamatera] (https://www.kamatera.com). The Data Colada folks say “it is much easier to use, way faster to set up, and flexible (e.g., you can easily update R and R Studio in it).” See Data Colada’s post for more information.

This kind of cloud computing is relatively straightforward, once you understand parallel on your own computer. But you can go further, where you dispatch a very large number of computers on your task. We discuss this last option next.

18.3 Parallel on a cluster

In general, a “cluster” is a system of computers that are connected up to form a large distributed network that many different people can use to do large computational tasks (e.g., simulations!). These clusters will have some overlaying coordinating programs that you, the user, will interact with to set up a “job,” or set of jobs, which is a set of tasks you want some number of the computers on the cluster to do for you in tandum.

These coordinating programs will differ, depending on what cluster you are using, but have some similarities that bear mention. For running simulations, you only need the smallest amount of knowledge about how to engage with these systems because you do not need all the individual computers working on your project communicating with each other (which is the hard part of distributed computing, in general).

18.3.1 Command-line interfaces and CMD mode in R

In the good ol’ days, when things were simpler, yet more difficult, you would interact with your computer via a “command-line interface.” The easiest way to think about this is as an R console, but in a different language that the entire computer speaks. A command line interface is designed to do things such as find files with a specific name, or copy entire directories, or, importantly, start different programs. One place you may have used a command line inteface is when working with Git: anything fancy with Git is often done via command-line. People will talk about a “shell” (a generic term for this computer interface) or “bash” or “csh.” You can get access to a shell from within RStudio by clicking on the “Terminal” tab. Try it, if you have never done anything like this before, and type (the “>” is not part of what you type):

> ls

It should list some file names. Note this command does not have the parenthesis after the command, like in R or most other programming languages. The syntax of a shell is usually mystifying and brutal: it is best to just steal scripts from the internet or some generative AI and try not to think about it too much, unless you want to think about it a lot.

Importantly for us, from the command line interface you can start an R program, telling it to start up and run a script for you. This way of running R is noninteractive: you say “go do this thing,” and R starts up, goes and does it, and then quits. Any output R generates on the way will be saved in a file, and any files you save along the way will also be at your disposal once R has completed.

To see this in action make the following script in a file called “dumb_job.R”:

library( tidyverse )
cat( "Making numbers\n" )
Sys.sleep(30)
cat( "Now I'm ready\n" )
dat = tibble( A = rnorm( 1000 ), B = runif( 1000 ) * A )
write_csv( dat, file="sim_results.csv" )
Sys.sleep(30)
cat( "Finished\n" )

After you save, open the terminal and type ls to list the files, as above. Do you see your dumb_job.R file? If not, your terminal session is in the wrong directory. In your computer system, files are stored in a directory structure, and when you open a terminal, you are somewhere in that structure.

To find out where, you can type

> pwd

for “Print Working Directory”. Save your dumb job file to wherever the above says. You can also change directories using cd, e.g., cd ~/Desktop/temp means “change directory to the temp folder inside Desktop inside my home directory” (the ~ is shorthand for home directory). One more useful commands is cd .. (go up to the parent directory).

Once you are in the directory with your file, type:

> R CMD BATCH dumb_job.R R_output.txt --no-save

The above command says “Run R” (the first part) in batch mode (the “CMD BATCH” part), meaning source the dumb_job.R script as soon as R starts, saving all console output in the file R_output.txt (it will be saved in the current directory where you run the program), and where you do not save the workspace when finished.

This command should take about a minute to complete, because our script sleeps a lot (the sleep represents your script doing a lot of work, like a real simulation would do). Once the command completes (you will see your “>” prompt come back), verify that you have the R_output.txt and the data file sim_results.csv by typing ls. If you open up your Finder or Microsoft equivilent, you can actually see the R_output.txt file appear half-way through, while your job is running. If you open it, you will see the usual header of R loading up, the “Making numbers” comment, and so forth. R is saving all the output as it works through your script.

Running R in this fashion is the key element to a basic way of setting up a massive job on the cluster: you launch a bunch of R programs that are each running a script like this on different computers in the cluster. They will all save their results to files (they will have files of different names so you do not overwrite your work) and then you will gather these files together to get your final set of results.

Small Exercise: Try putting an error in your dumb_job.R script. What happens when you run it in batch mode?

18.3.2 Task dispatchers and command line scripts

In the above, you can run a command on the command-line, but the command line interface will pause while it runs. As you saw, when you hit return with the above R command, the program just sat there for a minute before you got your command-line prompt back, due to the sleep.

When you properly run a big job (program) on a cluster, it does not quite work that way. You will instead set a program to run, but tell the cluster to run it somewhere else (people might say “run it in the background”). This is good because you get your command-line prompt back, and can use it to tell your computer to do other things, all while the program is running.

There are various methods for starting a job, but they usually boil down to a request from you to some sort of managerial process (a dispatcher) that takes requests and assigns some computer, somewhere, to do them. For a sense of this process, imagine a dispatcher at a taxi company. You call up, ask for a ride, and the dispatcher sends you a taxi to do it. The dispatcher is just fielding requests, assinging them to taxis.

One dispatcher on some clusters is called slurm (which may or may not be on the cluster you are attempting to use; this is where a lot of this information gets very cluster-specific).

You first set up a script that describes the job to be run. It is like a work request. These scripts are plain text files, such as this example (sbatch_runScript.txt):

#!/bin/bash

# Number of cores requested
#SBATCH -n 32

# Ensure that all cores are on one machine
#SBATCH -N 1

# Upper bound on runtime in minutes
#SBATCH -t 480

# Partition to submit to
#SBATCH -p stats

# Memory per cpu in MB
#SBATCH --mem-per-cpu=1000

# Append to output file, don't truncate
#SBATCH --open-mode=append

# Standard out goes to this file
#SBATCH -o /output/directory/out/%j.out

# Standard err goes to this file
#SBATCH -e /output/directory/out/%j.err

# Email notification options : BEGIN, END, FAIL, ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@gmail.com

# Load modules for the computing environment
source new-modules.sh
module load gcc/7.1.0-fasrc01
module load R

# Set user R library path
export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER

# Run R script; write output to indexed log file
R CMD BATCH estimator_performance_simulation.R \
                 logs/R_output_${INDEX_VAR}.txt \
                 --no-save --no-restore

This file starts by setting a bunch of variables that describe how sbatch should handle the request. It then has a series of commands that gets the computer environment ready. Finally, it has the R CMD BATCH command that does the work you want.

These scripts can be quite confusing to understand. There are so many options! What do these things even do? The answer is, for researchers early on their journey to do this kind of work, “Who knows?” The general rule is to find a working example file for the system you are on, and then modify it for your own purposes.

Once you have such a file, you could run it on the command line, like this:

sbatch -o stdout.txt \
        --job-name=my_script \
        sbatch_runScript.txt

You do this, and it will not sit there and wait for the job to be done. The sbatch command will instead send the job off to some computer which will do the work in the background.

Interestingly, your R script could, at this point, do the “one computer” parallel type code listed above and use future_pmap() or similar. Note the script above asks for 32 cores; your single job could then have 32 cores all working away on their individual pieces of the simulation, as before. You would have a 32-fold speedup, in this case.

This is the core element to having your simulation run on a cluster. The next step is to do this a lot, sending off a bunch of these jobs to different computers.

18.3.3 Moving files to and from a cluster

One of the larger issues with working on a cluster is you will need to get all your code to the cluster and get all of your results back. One approach for getting the code to the cluster is to use GitHub. Make a project, and then check out the project on the cluster. This has the advantage of making it easier to develop your code on your local computer, where you can easily run your RStudio or other IDE. It also has the advantage of allowing you to record which GitHub version of code you ran, so you can know exactly which code produced which results.

For getting your results off the cluster it is not recommended to use GitHub as total of all of your results may take many gigabytes of space. They are also not really in the spirit of a verson control system. Instead, use a scp client such as FileZilla to transfer files back to your personal computer.

Overall, download your simulation results outside of Git, and keep your code in Git, is a good rule of thumb.

18.3.4 Checking on a job

Once your job is working on the cluster, it will keep at it until it finishes (or crashes, or is terminated for taking up too much memory or time). As it chugs away, there will be different ways to check on it. For example, you can, from the console, list the jobs you have running to see what is happening:

sacct -u lmiratrix

except, of course, “lmiratrix” would be changed to whatever your username is. This will list if your file is running, pending, timed out, etc. If it’s pending, that usually means that someone else is hogging up space on the cluster and your job request is in a queue waiting to be assigned.

The sacct command is customizable, e.g.,

sacct -u lmiratrix --format=JobID,JobName%30,State

will not truncate your job names, so you can find them more easily.

You can check on a specific job, if you know the ID:

squeue -j JOBID

Something that’s fun is you can check who’s running files on the stats server by typing:

showq-slurm -p stats -o

You can also look at the log files

tail my_log_file.log

to see if it is logging information as it is working.

The email arguments, above, cause the system to email you before and after the job is complete. The email notifications you can choose are BEGIN, END, FAIL, and ALL; ALL is generally good. What is a few more emails?

18.3.5 Running lots of jobs on a cluster

We have seen how to fire off a job (possibly a big job) that can run over days or weeks to give you your results. There is one more piece that can allow you to use even more computing resources to do things even faster, which is to do a whole bunch of job requests like the above, all at once. This multiple dispatching of sbatch commands is the final component for large simulations on a cluster: you are setting in motion a bunch of processes, each set to a specific task.

Asking for multiple, smaller, jobs is also nicer for the cluster than having one giant job that goes on for a long time. By dividing a job into smaller pieces, and asking the scheduler to schedule those pieces, you can let the scheduler share and allocate resources between you and others more fairly. It can make a list of your jobs, and farm them out as it has space. This might go faster for you; with a really big job, the scheduler could not even allocate it until the needed number of workers are available. With smaller jobs, you can take a lot of little spaces to get your work done. Especially since simulation is so independent (just doing the same thing over and over) there is rarely any need for one giant process that has to do everything.

To make multiple, related, requests, we create a for-loop in the Terminal to make a whole series sbatch requests. Then, each sbatch request will do one part of the overall simulation. We can write this program in the shell, just like you can write R scripts in R. A shell scripts does a bunch of shell commands for you, and can even have variables and loops and all of that fun stuff.

For example, the following run_full_simulation.sh is a script that fires off a bunch of jobs for a simulation. Note that it makes a variable INDEX_VAR, and sets up a loop so it can run 500 tasks indexed 1 through 500.

The first export line adds a collection of R libraries to the path stored in R_LIBS_USER (a “path” is a list of places where R will look for libraries). The next line sets up a for loop: it will run the indented code once for each number from 1 to 500. The script also specifies where to put log files and names each job with the index so you can know who is generating what file.

export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER

for INDEX_VAR in $(seq 1 500); do

  #print out indexes
  echo "${INDEX_VAR}"

  #give indexes to R so it can find them.
  export INDEX_VAR 

  #Run R script, and produce output files
  sbatch -o logs/sbout_p${INDEX_VAR}.stdout.txt \
        --job-name=runScr_p${INDEX_VAR} \
        sbatch_runScript.txt
  
  sleep 1 # pause to be kind to the scheduler

done

One question is then how do the different processes know what part of the simulation they should be working on? E.g., each worker needs to have its own seed so it does not do exactly the same simulation as a different worker! The workers also need their own filenames so they save things in their own files. The key is the export INDEX_VAR line: this puts a variable in the environment that will be set to a specific number. Inside your R script, you can get that index like so:

index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))

You can then use the index to make unique filenames when you save your results, so each process has its own filename:

filename = paste0( "raw_results/simulation_results_", index, _".rds" )

You can also modify your seed such as with:

factors = mutate( factors,
                  seed = set.seed( 1000 * seed + index ) )

Now even if you have a series of seeds within the simulation script (as we have seen before), each script will have unique seeds not shared by any other script (assuming you have fewer than 1000 separate job requests).

This still doesn’t exactly answer how to have each worker know what to work on. Conider the case of our multifactor experiment, where we have a large combination of simulation trials we want to run.

There are two approaches one might use here. One simple approach is the following: generate all the factors with expand_grid() as usual, and then take the row of this grid that corresponds to our index.

sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
filename = paste0( "raw_results/simulation_results_", index, _".rds" )

stopifnot( index >= 1 && index <= nrow(sim_factors ) )
do.call( my_sim_function, sim_factors[ index, ] )

The do.call() command runs the simulation function, passing all the arguments listed in the targeted row. You then need to make sure you have your shell call the right number of workers to run your entire simulation.

One problem with this approach is some simulations might be a lot more work than others: consider your simulation with a huge sample size vs. one with a small sample size. Instead, you can have each worker run a small number of simulations of each scenario, and then stack your results later. E.g.,

sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
sim_factors$seed = 1000000 * index + 17 * 1:nrow(sim_factors)

and then do your usual pmap call with R = 10 (or some other small number of replicates.)

For saving files and then loading and combining them for analysis, see Section 17.4.

18.3.6 Resources for Harvard’s Odyssey

The above guidiance is tailored for Harvard’ computing environment, primarily. For that environment in particular, there are many additional resources such as:

For installing R packages so they are seen by the scripts run by sbatch, see (https://www.rc.fas.harvard.edu/resources/documentation/software-on-odyssey/r/)

Other clusters should have similar documents giving needed guidance for their specific contexts.

18.3.7 Acknowledgements

Some of the above material is based on tutorials built by Kristen Hunter and Zach Branson, past doctoral students of Harvard’s statistics department.