Chapter 20 Parallel Processing
Especially if you take our advice of “when in doubt, go more general” and if you calculate monte carlo standard errors, you will quickly come up against the limits of your computer. Simultions can be incredibly computationally intensive, and there are a few means for dealing with that. The first, touched on at times throughout the book, is to optimize ones code by looking for ways to remove extraneous calculation (e.g., by writing ones own methods rather than using the safety-checking and thus sometimes slower methods in R, or by saving calculations that are shared across different estimation approaches). The second is to use more computing power. This latter approach is the topic of this chapter.
There are two general ways to do parallel calculation. The first is to take advantage of the fact that most modern computers have multiple cores (i.e., computers) built in. With this approach, we tell R to use more of the processing power of your desktop or laptop. If your computer has eight cores, you can easily get a near eight-fold increase in the speed of your simulation.
The second is to use cloud computing, or compute on a cluster. A computing cluster is a network of hundreds or thousands of computers, coupled with commands where you break apart a simulation into pieces and send the pieces to your army of computers. Conceptually, this is the same as when you do baby parallel on your desktop: more cores equals more simulations per minute and thus faster simulation overall. But the interface to a cluster can be a bit tricky, and very cluster-dependent.
But once you get it up and running, it can be a very powerful tool. First, it takes the computing off your computer entirely, making it easier to set up a job to run for days or weeks without making your day to day life any more difficult. Second, it gives you hundreds of cores, potentially, which means a speed-up of hundreds rather than four or eight.
Simulations are a very natural choice for parallel computation. With a multifactor experiment it is very easy to break apart the overall into pieces. For example, you might send each factor combination to a single machine. Even without multi factor experiments, due to the cycle of “generate data, then analyze,” it is easy to have a bunch of computers doing the same thing, with a final collection step where all the individual iterations are combined into one at the end.
20.1 Parallel on your computer
Most modern computers have multiple cores, so you can run a parallel simulation right in the privacy of your own home!
To assess how many cores you have on your computer, you can use the detectCores()
method in the parallel
package:
## [1] 12
Normally, unless you tell it to do otherwise, R only uses one core.
This is obviously a bit lazy on R’s part.
But it is easy to take advantage of multiple cores using the future
and furrr
packages.
In particular, the furrr
package replicates our map
functions, but in parallel.
We first tell our R session what kind of parallel processing we want using the future
package.
In general, using plan(multisession)
is the cleanest: it will start one entire R session per core, and have each session do work for you.
The alternative, multicore
does not seem to work well with Windows machines, nor with RStudio in general.
The call is simple:
The workers
parameter specifies how many of your cores you want to use.
Using all but one will let your computer still operate mostly normally for checking email and so forth.
You are carving out a bit of space for your own adventures.
Once you set up your plan, you use future_pmap()
; it works just like pmap()
but evaluates across all available workers specified in the plan call.
Here we are running a parallel version of the multifactor experiment discussed in Chapter @ref(exp_design) (see chapter @ref(case_Cronback) for the simulation itself).
tictoc::tic()
params$res = future_pmap(params,
.f = run_alpha_sim,
.options = furrr_options(seed = NULL))
tictoc::tic()
Note the .options = furrr_options(seed = NULL)
part of the argument.
This is to silence some warnings.
Given how tasks are handed out, R will get upset if you don’t do some handholding regarding how it should set seeds for pseudoranom number generation.
In particular, if you don’t set the seed, the multiple sessions could end up having the same starting seed and thus run the exact same simulations (in principle).
We have seen before how to set specific seed for each simulation scenario, but furrr
doesn’t know we have done this.
This is why the extra argument about seeds: it is being explicit that we are handling seed setting on our own.
We can compare the running time to running in serial (i.e. using only one worker):
tictoc::tic()
params$res2 = dplyr::select(params, n:seed) %>%
pmap(.f = run_alpha_sim)
tictoc::tic()
(The select
command is to drop the res
column from the parallel run; it would otherwise be passed as as parameter to run_alpha_sim
which would in turn cause an error due to the unrecognized parameter.)
20.2 Parallel off your computer
In general, a “cluster” is a system of computers that are connected up to form a large distributed network that many different people can use to do large computational tasks (e.g., simulations!). These clusters will have some overlaying coordinating programs that you, the user, will interact with to set up a “job,” or set of jobs, which is a set of tasks you want some number of the computers on the cluster to do for you in tandum.
These coordinating programs will differ, depending on what cluster you are using, but have some similarities that bear mention. For running simulations, you only need the smallest amount of knowledge about how to engage with these systems because you don’t need all the individual computers working on your project communicating with each other (which is the hard part of distributed computing, in general).
20.2.1 What is a command-line interface?
In the good ol’ days, when things were simpler, yet more difficult, you would interact with your computer via a “command-line interface.” The easiest way to think about this is as an R console, but in a different language that the entire computer speaks. A command line interface is designed to do things like find files with a specific name, or copy entire directories, or, importantly, start different programs. Another place you may have used a command line inteface is when working with Git: anything fancy with Git is often done via command-line. People will talk about a “shell” (a generic term for this computer interface) or “bash” or “csh.” You can get access to a shell from within RStudio by clicking on the “Terminal” tab. Try it, if you’ve never done anything like this before, and type
ls
It should list some file names. Note this command does not have the parenthesis after the command, like in R or most other programming languages. The syntax of a shell is usually mystifying and brutal: it is best to just steal scripts from the internet and try not to think about it too much, unless you want to think about it a lot.
Importantly for us, from the command line interface you can start an R program, telling it to start up and run a script for you. This way of running R is noninteractive: you say “go do this thing,” and R starts up, goes and does it, and then quits. Any output R generates on the way will be saved in a file, and any files you save along the way will also be at your disposal once R has completed.
To see this in action make the following script in a file called “dumb_job.R”:
library( tidyverse )
cat( "Making numbers\n" )
Sys.sleep(30)
cat( "Now I'm ready\n" )
dat = tibble( A = rnorm( 1000 ), B = runif( 1000 ) * A )
write_csv( dat, file="sim_results.csv" )
Sys.sleep(30)
cat( "Finished\n" )
Then open the terminal and type (the “>” is not part of what you type):
> ls
Do you see your dumb_job.R
file? If not, your terminal session is in the wrong directory.
In your computer system, files are stored in a directory structure, and when you open a terminal, you are somewhere in that structure.
To find out where, you can type
> pwd
for “Print Working Directory”.
Save your dumb job file to wherever the above says.
You can also change directories using cd
, e.g., cd ~/Desktop/temp
means “change directory to the temp folder inside Desktop inside my home directory” (the ~
is shorthand for home directory).
One more useful commands is cd ..
(go up to the parent directory).
Once you are in the directory with your file, type:
> R CMD BATCH dumb_job.R R_output.txt --no-save
The above command says “Run R” (the first part) in batch mode (the “CMD BATCH” part), meaning source the dumb_job.R
script as soon as R starts, saving all console output in the file R_output.txt
(it will be saved in the current directory where you run the program), and where you don’t save the workspace when finished.
This command should take about a minute to complete, because our script sleeps a lot (the sleep represents your script doing a lot of work, like a real simulation would do).
Once the command completes (you will see your “>” prompt come back), verify that you have the R_output.txt
and the data file sim_results.csv
by typing ls
.
If you open up your Finder or Microsoft equivilent, you can actually see the R_output.txt
file appear half-way through, while your job is running.
If you open it, you will see the usual header of R telling you what it loading, the “Making numbers” comment, and so forth.
R is saving everything as it works through your script.
Running R in this fashion is the key element to a basic way of setting up a massive job on the cluster: you will have a bunch of R programs all “going and doing something” on different computers in the cluster. They will all save their results to files (they will have files of different names, or you will not be happy with the end result) and then you will gather these files together to get your final set of results.
Small Exercise: Try putting an error in your dumb_job.R
script. What happens when you run it in batch mode?
20.2.2 Running a job on a cluster
In the above, you can run a command on the command-line, and the command line interfact will pause while it runs. As you saw, when you hit return with the above R command, the program just sat there for a minute before you got your command-line prompt back, due to the sleep.
When you properlly run a big job (program) on a cluster, it doesn’t quite work that way. You will instead set a program to run, but tell the cluster to run it somewhere else (people might say “run in the background”). This is good because you get your command-line prompt back, and can do other things, while the program runs in the background.
There are various methods for doing this, but they usually boil down to a request from you to some sort of managerial process that takes requests and assigns some computer, somewhere, to do them. (Imagine a dispatcher at a taxi company. You call up, ask for a ride, and it sends you a taxi to do it. The dispatcher is just fielding requests, assinging them to taxis.)
For example, one dispatcher is the slurm (which may or may not be on the cluster you are attempting to use; this is where a lot of this information gets very cluster-specific).
You first set up a script that describes the job to be run.
It is like a work request.
This would be a plain text file, such as this example (sbatch_runScript.txt
):
#!/bin/bash
#SBATCH -n 32 # Number of cores requested
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 480 # Runtime in minutes
#SBATCH -p stats # Partition to submit to
#SBATCH --mem-per-cpu=1000 # Memory per cpu in MB
#SBATCH --open-mode=append # Append to output file, don't truncate
#SBATCH -o /output/directory/out/%j.out # Standard out goes to this file
#SBATCH -e /output/directory/out/%j.err # Standard err goes to this file
#SBATCH --mail-type=ALL # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=email@gmail.com # Email address
# You might have some special loading of modules in the computing environment
source new-modules.sh
module load gcc/7.1.0-fasrc01
module load R
export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER
#R file to run, and txt files to produce for output and errors
R CMD BATCH estimator_performance_simulation.R logs/R_output_${INDEX_VAR}.txt --no-save --no-restore
This file starts with a bunch of variables that describe how sbatch should handle the request.
It then has a series of commands that get the computer environment ready.
Finally, it has the R CMD BATCH
command that does the work you want.
These scripts can be quite confusing to understand. There are so many options! What do these things even do? The answer is, for researchers early on their journey to do this kind of work, “Who knows?” The general rule is to find an example file for the system you are working on that works, and then modify it for your own purposes.
Once you have such a file, you could run it on the command line, like this:
sbatch -o stdout.txt \
--job-name=my_script \
sbatch_runScript.txt
You do this, and it will not sit there and wait for the job to be done.
The sbatch
command will instead send the job off to some computer which will do the work in parallel.
Interestingly, your R script could, at this point, do the “one computer” parallel type code listed above.
Note the script above has 32 cores; your single job could then have 32 cores all working away on their individual pieces of the simulation, as before (e.g., with future_pmap
).
You would have a 32-fold speedup, in this case.
This is the core element to having your simulation run on a cluster. The next step is to do this a lot, sending off a bunch of these jobs to different computers.
Some final tips
Remember to save a workspace or RDS!! Once you tell Odyssey to run an R file, it, well, runs the R file. But, you probably want information after it’s done - like an R object or even an R workspace. For any R file you want to run on Odyssey, remember at the end of the R file to put a command to save something after everything else is done. If you want to save a bunch of R objects, an R workspace might be a good way to go, but those files can be huge. A lot of times I find myself wanting only one or two R objects, and RDS files are a lot smaller.
Moving files from a cluster to your computer. You will need to first upload your files and code to the cluster, and then, once you’ve saved your workspace/RDS, you need those back on your computer. Using a scp client such as FileZilla is an easy way to do this file-transfer stuff. You can also use a Git repo for the code, but checking in the simulation results is not generally advisable: they are big, and not really in the spirit of a verson control system. Download your simulation results outside of Git, and keep your code in Git, is a good rule of thumb.
20.2.3 Checking on a job
Once your job is working on the cluster, it will keep at it until it finishes (or crashes, or is terminated for taking up too much memory or time). As it chugs away, there will be different ways to check on it. For example, you can, from the console, list the jobs you have running to see what is happening:
sacct -u lmiratrix
except, of course, “lmiratrix
” would be changed to whatever your username is.
This will list if your file is running, pending, timed out, etc. If it’s pending, that usually means that someone else is hogging up space on the cluster and your job request is in a queue waiting to be assigned.
The sacct
command is customizable, e.g.,
sacct -u lmiratrix --format=JobID,JobName%30,State
will not truncate your job names, so you can find them more easily.
You can check on a specific job, if you know the ID:
squeue -j JOBID
Something that’s fun is you can check who’s running files on the stats server by typing:
showq-slurm -p stats -o
You can also look at the log files
tail my_log_file.log
to see if it is logging information as it is working.
The email arguments, above, cause the system to email you before and after the job is complete.
The email notifications you can choose are BEGIN
, END
, FAIL
, and ALL
; ALL
is generally good. What is a few more emails?
20.2.4 Running lots of jobs on a cluster
We have seen how to fire off a job (possibly a big job) that can run over days or weeks to give you your results. There is one more piece that can allow you to use even more computing resources to do things even faster, which is to do a whole bunch of job requests like the above, all at once. This multiple dispatching of sbatch commands is the final component for large simulations on a cluster: you are setting in motion a bunch of processes, each set to a specific task.
Asking for multiple, smaller, jobs is also nicer for the cluster than having one giant job that goes on for a long time. By dividing a job into smaller pieces, and asking the scheduler to schedule those pieces, you can let the scheduler share and allocate resources between you and others more fairly. It can make a list of your jobs, and farm them out as it has space. This might go faster for you; with a really big job, the scheduler can’t even allocate it until the needed number of workers is available. With smaller jobs, you can take a lot of little spaces to get your work done. Especially since simulation is so independent (just doing the same thing over and over) there is rarely any need for one giant process that has to do everything.
To make multiple, related, requests, we create a for-loop in the Terminal to make a whole series sbatch requests. Then, each sbatch request will do one part of the overall simulation. We can write this program in the shell, just like you can write R scripts in R. A shell scripts does a bunch of shell commands for you, and can even have variables and loops and all of that fun stuff.
For example, the following run_full_simulation.sh
is a script that fires off a bunch of jobs for a simulation.
Note that it makes a variable INDEX_VAR
, and sets up a loop so it can run 500 tasks indexed 1 through 500.
The first export
line adds a collection of R libraries to the path stored in R_LIBS_USER
(a “path” is a list of places where R will look for libraries).
The next line sets up a for loop: it will run the indented code once for each number from 1 to 500.
The script also specifies where to put log files and names each job with the index so you can know who is generating what file.
export R_LIBS_USER=$HOME/apps/R:$R_LIBS_USER
for INDEX_VAR in $(seq 1 500); do
#print out indexes
echo "${INDEX_VAR}"
#give indexes to R so it can find them.
export INDEX_VAR
#Run R script, and produce output files
sbatch -o logs/sbout_p${INDEX_VAR}.stdout.txt \
--job-name=runScr_p${INDEX_VAR} \
sbatch_runScript.txt
sleep 1 # pause to be kind to the scheduler
done
One question is then how do the different processes know what part of the simulation they should be working on?
E.g., each worker needs to have its own seed so it don’t do exactly the same simulation as a different worker!
The workers also need their own filenames so they save things in their own files.
The key is the export INDEX_VAR
line: this puts a variable in the environment that will be set to a specific number.
Inside your R script, you can get that index like so:
You can then use the index to make unique filenames when you save your results, so each process has its own filename:
You can also modify your seed such as with:
Now even if you have a series of seeds within the simulation script (as we have seen before), each script will have unique seeds not shared by any other script (assuming you have fewer than 1000 separate job requests).
This still doesn’t exactly answer how to have each worker know what to work on. Conider the case of our multifactor experiment, where we have a large combination of simulation trials we want to run.
There are two approaches one might use here.
One simple approach is the following: we first generate all the factors with expand_grid()
as usual, and then we take the row of this grid that corresponds to our index.
sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
filename = paste0( "raw_results/simulation_results_", index, _".rds" )
stopifnot( index >= 1 && index <= nrow(sim_factors ) )
do.call( my_sim_function, sim_factors[ index, ] )
The do.call()
command runs the simulation function, passing all the arguments listed in the targeted row.
You then need to make sure you have your shell call the right number of workers to run your entire simulation.
One problem with this approach is some simulations might be a lot more work than others: consider your simulation with a huge sample size vs. one with a small sample size. Instead, you can have each worker run a small number of simulations of each scenario, and then stack your results later. E.g.,
sim_factors = expand_grid( ... )
index <- as.numeric(as.character(Sys.getenv("INDEX_VAR")))
sim_factors$seed = 1000000 * index + 17 * 1:nrow(sim_factors)
and then do your usual pmap
call with R = 10
(or some other small number of replicates.)
For saving files and then loading and combining them for analysis, see Section 19.
20.2.5 Resources for Harvard’s Odyssey
The above guidiance is tailored for Harvard’ computing environment, primarily. For that environment in particular, there are many additional resources such as:
- Odyssey Guide: https://rc.fas.harvard.edu/resources/odyssey-quickstart-guide/
- R on Odyssey: https://rc.fas.harvard.edu/resources/documentation/software/r/
For installing R packages so they are seen by the scripts run by sbatch, see (https://www.rc.fas.harvard.edu/resources/documentation/software-on-odyssey/r/)
Other clusters should have similar documents giving needed guidance for their specific contexts.