Best practices for R on HPC

Geert Jan Bex, Ehsan Moravveji, Mariana Montes

2024-12-03

Overview

  • Motivation
  • How to run R on HPC systems
    • Open OnDemand
    • Batch jobs
  • Walltime & memory
  • Efficiency
  • Improving performance
    • Profiling
    • Code optimization
  • Wrap up

Motivation

Why R on HPC?

  • Computation takes lots of time
  • Large dataset, not enough RAM
  • Many different scenarios to compute

Or any combination of these

Is it hard?

Warning

There is a learning curve

Tip

It’s not rocket science

  • Training is available
  • Support is there to help

Running R on HPC systems

Open OnDemand

  • Web-based platform for accessing HPC resources
    • RStudio
    • Jupyter Lab
    • Terminal
    • File browser
  • Convenient for interactive development
  • Very inefficient for computations

Batch jobs

Example “computation” script:

hello_world.R
# Get the hostname
hostname <- Sys.info()["nodename"]

# Print the message
cat("Hello from", hostname, "!\n")

No suprises here

Testing the script

Terminal:

$ module load R/4.4.0-gfbf-2023a
$ Rscript hello.R
Hello from tier2-login-p-2 !
1
Load the R version you want to use
2
Run the script

Warning

Don’t run computationally intensive tasks on the login node!!!

Module system

What & why?

  • Different versions of software packages
  • Reproducibility

How?

  • Search for modules: module -r spider ^R/
  • Load a module: module load R/4.4.0-gfbf-2023a
  • List loaded modules: module list
  • Unload all modules: module purge

Job script

hello_world.slurm
#!/usr/bin/env -S bash -l
#SBATCH --account=lpt2_sysadmin
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:02:00
#SBATCH --cluster=wice

module purge &> /dev/null
module load R/4.4.0-gfbf-2023a

Rscript ./hello_world.R
1
Initialize the environment
2
Specify the account for credits
3
Number of nodes
4
Number of tasks
5
Number of CPUs per task
6
Maximum runtime of job
7
Cluster to run on
8
Unload all modules
9
Load the R version
10
Run the script

Slurm speak

Slurm is

  • resource manager
  • job scheduler

User(s) submit jobs to Slurm

  • Job in queue
  • When resources available, job starts

Efficient use of (very expensive) HPC resources

Submitting a job

$ sbatch hello_world.slurm
Submitted batch job 62136541 on cluster wice
1
Submit the job script hello_world.slurm
2
Job ID is 62136541 (incremental)

Checking job status

Show all your jobs on cluster wice:

$ squeue --cluster=wice
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          62136541      wice hello_wo  mmontes  R       0:00      1 p33c20n3

ST is the job state:

  • PD: pending
  • R: running
  • CG: completing

No jobs shown? All done!

Job output

Saved to file slurm-<jobid>.out:

slurm-62136541.out
SLURM_JOB_ID: 62136541
SLURM_JOB_USER: vsc30140
SLURM_JOB_ACCOUNT: lpt2_sysadmin
SLURM_JOB_NAME: hello_world.slurm
SLURM_CLUSTER_NAME: wice
SLURM_JOB_PARTITION: batch
SLURM_NNODES: 1
SLURM_NODELIST: p33c20n3
SLURM_JOB_CPUS_PER_NODE: 1
Date: Tue Aug  6 08:58:09 CEST 2024
Walltime: 00-00:02:00
========================================================================
Hello from p33c20n3 !
1
Info about job
2
Output of the script

Runtime parameters

Scenario: run R script with different parameters

$ Rscript hello_world_cla.R  --name "Alice"
Hello Alice !
$ Rscript hello_world_cla.R  --name "Bob"  --count 3
Hello Bob !
Hello Bob !
Hello Bob !

optparse package

hello_world_cla.R
library(optparse)

# Define the options
option_list <- list(
  make_option(c("--name"), type="character", default="World",
              help="Name to greet"),
  make_option(c("--count"), type="integer", default=1,
              help="Number of times to greet")
)
opt <- parse_args(OptionParser(option_list=option_list))

for (i in 1:opt$count) {
  cat("Hello", opt$name, "!\n")
}
1
Load the optparse package
2
Define the options
3
Option for the name, character type, default value “World”
4
Option for the count, integer type, default value 1
5
Parse the arguments
6
Use the arguments in list opt

Job script with parameters

hello_world_cla.slurm
#!/usr/bin/env -S bash -l
#SBATCH --account=lpt2_sysadmin
#SBATCH --time=00:02:00
#SBATCH --cluster=wice

module purge &> /dev/null
module load R/4.4.0-gfbf-2023a

Rscript ./hello_world_cla.R  --name "Bob"  --count 3

File I/O

Easiest option: relative paths

$ Rscript ./weather_analysis.R  \
      --input data/London.csv   \
      --quantity temp_max       \
      --output tempmax_london.csv
1
Input file in data directory, subdirectory of script’s directory
2
Output file in script’s directory

Warning

Paths are relative to location you run script from!

Here: directory that contains weather_analysis.R

R script for weather analysis

weather_analysis.R
library(optparse)
option_list <- list(
    make_option(c("-i", "--input"), type = "character",
                help = "Path to the input CSV file"),
    make_option(c("-q", "--quantity"), type = "character",
                help = "Name of the quantity to analyze (e.g., temp, tempmax, tempmin)"),
    make_option(c("-o", "--output"), type = "character",
                help = "Path to the output CSV file")
)
opt <- parse_args(OptionParser(option_list = option_list))
data <- read.csv(opt$input)
if (!(opt$quantity %in% colnames(data))) {
    stop(paste("The specified quantity", opt$quantity, "is not present in the data"))
}
data$month <- format(as.Date(data$datetime), "%m")
monthly_avg <- aggregate(data[[opt$quantity]], by = list(data$month), FUN = mean)
colnames(monthly_avg) <- c("Month", "Average")
write.csv(monthly_avg, file = opt$output, row.names = FALSE)

Job script for weather analysis

weather_analysis.slurm
#!/usr/bin/env -S bash -l
#SBATCH --account=lpt2_sysadmin
#SBATCH --time=00:02:00
#SBATCH --cluster=wice

# first clean up your environment, then load R
module purge &> /dev/null
module load R/4.4.0-gfbf-2023a

# define output directory and create it if it does not exist
OUTDIR="output"
mkdir -p $OUTDIR

# define the quantity to analyse
QUANTITY="tempmax"

Rscript ./weather_analysis.R  \
    --input data/London.csv   \
    --quantity "$QUANTITY"    \
    --output $OUTDIR/$QUANTITY.csv

But wait, there is no optparse!

  • Many available through R-bundle-CRAN module

    $ module load R-bundle-CRAN/2023.12-foss-2023a
  • Not all packages are installed though

  • Install packages in VSC_DATA directory

    • Pure R/no depencies on external libraries: easy
    • Other packages: more difficult, use modules
  • Packages specific for R major version/cluster/architecture

Installation procedure

Start interactive job on cluster wice

$ srun --account=lpt2_sysadmin --time=00:30:00 \
      --cluster=wice --partition=batch_sapphirerapids \
      --pty /bin/bash -l

Load

  • R module

    $ module load R/4.4.0-gfbf-2023a
  • or R-bundle-CRAN module

    $ module load R-bundle-CRAN/2023.12-foss-2023a

Prepare environment

Define environment variable

$ export R_LIBS_USER=$VSC_DATA/R/$VSC_ARCH_LOCAL/$EBVERSIONR

Create package directory for cluster/architecture/R version

$ mkdir -p $R_LIBS_USER

Only once per cluster/architecture/R version!

Create .Renviron file in $VSC_HOME

$ echo "R_LIBS_USER=$R_LIBS_USER" > $VSC_HOME/.Renviron

Works only for specified cluster/architecture/R version!

Actual installation

Install the package(s)

$ R
> install.packages("optparse")

Tip

Use renv for managing R environments and dependencies

If you have trouble, contact the HPC helpdesk, specify

  • packages to install
  • R version
  • cluster and architecture

Walltime & memory

How long will your job run?

  • Specify walltime (--time=HH:MM:SS)
  • But how much?
    • Don’t underestimate!
    • Don’t (massively) overestimate
  • Experiment
    • Start with “small” problem
    • Increase problem size gradually
    • Extrapolate to real problem

Running example

R script dgemm.R

  • takes argument \(N\) and \(p\)
  • generates matrix \(A \in \mathbb{R}^{N \times N}\), elements normally distributed
  • computes \(A^p\)
  • determines minimum and maximum diagonal element
Rscript dgemm.R  --size 1000  --power 15

Benchmark job script

memory_and_walltime.slurm
#!/usr/bin/env -S bash -l
#SBATCH --account=lpt2_sysadmin
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=01:00:00
#SBATCH --partition=batch_sapphirerapids
#SBATCH --cluster=wice

module load R/4.4.0-gfbf-2023a
module load GCCcore/12.3.0

Rscript dgemm.R --size $SIZE --power $POWER
1
Note SIZE and POWER environment variables

Running benchmark

Submit with range of values for SIZE and POWER, e.g.,

$ sbatch  --export=ALL,SIZE=5000,POWER=10  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=5000,POWER=20  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=5000,POWER=40  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=10000,POWER=10  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=10000,POWER=20  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=10000,POWER=40  memory_and_walltime.slurm
...

When job, e.g., 62199204, finished, use sacct

$ sacct  --cluster=wice  --jobs=62199204.batch  --format=jobid,elapsed

Walltime benchmark results

job ID \(N\) \(p\) walltime
62199204 5000 10 00:02:17
62199205 5000 20 00:04:36
62199206 5000 40 00:08:50
62199207 10000 10 00:15:10
62199208 10000 20 00:30:57
62199209 10000 40 01:00:15
62199210 20000 10 00:04:19
62199211 20000 20 00:04:24
62199212 20000 40 00:04:34
62199213 50000 10 00:00:55
62199214 50000 20 00:00:45
62199215 50000 40 00:00:44
  • \(T(N, p) \propto p\)
  • \(T(N, p) \propto N^3\)
  • What happened for \(N = 10000\), \(p = 40\)?
  • What happend for \(N = 20000\) and \(N = 50000\)?

Out of time

Inspect output for job 62199209

...
Walltime: 00-01:00:00
========================================================================

Lmod is automatically replacing "cluster/genius/login" with
"cluster/wice/batch_sapphirerapids".

slurmstepd: error: *** JOB 62199209 ON q16c03n1 CANCELLED
AT 2024-08-13T18:28:16 DUE TO TIME LIMIT ***
1
Oopsie: walltime exceeded requested walltime

Warning

No output, 1 hour of computime lost!

What about \(N\) = 20000, 50000

job ID \(N\) \(p\) walltime
62199210 20000 10 00:04:19
62199211 20000 20 00:04:24
62199212 20000 40 00:04:34
62199213 50000 10 00:00:55
62199214 50000 20 00:00:45
62199215 50000 40 00:00:44
...
Walltime: 00-01:00:00
========================================================================
...
slurmstepd: error: Detected 1 oom_kill event in StepId=62199210.batch.
Some of the step tasks have been OOM Killed.
1
Oopsie, Out Of Memory (OOM)

Memory

Check RAM of node in docs, typically 256 GB

  • Total memory for job
    • specify with --mem, e.g.,

      #SBATCH --mem=10G
    • --mem < RAM - 8 GB

  • Memory per CPU
    • specify with --mem-per-cpu, e.g.,

      #SBATCH --mem-per-cpu=5G
    • --mem-per-cpu \(\times\) --cpus-per-task \(\times\) --ntasks < RAM - 8 GB

Units: K, M, G, T

How much memory do you need?

  • Specify memory (--mem=30)
  • But how much?
    • Don’t underestimate!
    • Don’t (massively) overestimate
  • Experiment
    • Start with “small” problem
    • Increase problem size gradually
    • Extrapolate to real problem

How much was used?

Submit with range of values for SIZE and POWER, e.g.,

$ sbatch  --export=ALL,SIZE=5000,POWER=10  memory_and_walltime.slurm
$ sbatch  --export=ALL,SIZE=5000,POWER=20  memory_and_walltime.slurm
...

When job, e.g., 62199204, finished, use sacct

$ sacct  --cluster=wice  --jobs=62199204.batch  \
         --format=jobid,maxrss,maxvmsize

Tip

Combine walltime and memory benchmark!

Memory benchmark results

job ID \(N\) \(p\) \(M\) (GB)
62204487 5000 10 0.82
62204488 5000 20 1.01
62204489 5000 40 0.82
62204490 10000 10 3.81
62204491 10000 20 3.82
62204492 10000 40 3.82
62204493 20000 10 14.61
62204494 20000 20 12.03
62204495 20000 40 12.03
  • \(M(N, p) \propto N^2\)
  • \(M(N, p)\) constant in \(p\)

Summary walltime & memory

Do not underestimate!

Warning

You will waste resource

Benchmark for walltime and memory

Warning

Unless cost of benchmarking is higher than production cost

Efficiency

Implicit parallelism

Some R functions/operations are parallel under the hood

A <- matrix(rnorm(25), nrow=5)
result <- A
for (i in 1:(power-1)) {
    result <- result %*% A
}

How do you know?

  • Check the documentation
  • Measure it

Benchmark environment

Experiments done on Sapphire Rapids node

  • 2 Intel Xeon Platinum 8468 48-core
  • 256 GB RAM

Module R/4.4.0-gfbf-2023a

Parallel or not?

Time the execution of R script

$ time Rscript dgemm.R &> /dev/null

real    0m4.791s
user    1m21.693s
sys     0m31.345s
1
Real time, i.e., walltime
2
User time, i.e., CPU time used

82 seconds CPU time in 4.8 seconds walltime… parallelism!

%*% use Basic Linear Algebra Subroutines (BLAS)

Uses multiple threads, each on a core

How to exploit that?

Easy: set --cpus-per-task > 1

CPUs per task (\(n\)) walltime (\(T_n\))
1 109.6
2 64.8
4 50.5
8 40.2
12 34.3
24 31.5
48 31.0
96 33.0

Nice!

Speedup

\(S(n) = \frac{T_1}{T_n}\)

\(n\) \(T_n\) \(S(n)\)
1 109.6 1.00
2 64.8 1.69
4 50.5 2.17
8 40.2 2.73
12 34.3 3.20
24 31.5 3.48
48 31.0 3.54
96 33.0 3.32

Nice? Maybe…

Parallel efficiency

\(E(n) = \frac{T_1}{n \cdot T_n}\)

\(n\) \(T_n\) \(S(n)\) \(E(n)\)
1 109.6 1.00 1.00
2 64.8 1.69 0.85
4 50.5 2.17 0.54
8 40.2 2.73 0.34
12 34.3 3.20 0.27
24 31.5 3.48 0.14
48 31.0 3.54 0.07
96 33.0 3.32 0.03

Nice?, not so much… HPC infrastructure is very expensive!

Lots of matrices…

Often same script run for many parameter settings (\(N\))

\(N\) \(T_N\) \(S(N)\) \(E(N)\)
1 14.0 1.00 1.00
2 33.6 0.42 0.21
4 90.1 0.16 0.04
8 184.4 0.08 0.01
12 252.5 0.06 0.005
18 635.2 0.02 0.001

Oops!?! Worse than serial?

All R instances use 96 threads, so for \(N = 18\)\(18 \times 96 = 1728\) threads \(\gg\) number of cores

Warning

Massive oversubscription

Control OMP_NUM_THREADS!

Set OMP_NUM_THREADS=2

\(N\) \(T_N\) \(S(N)\) \(E(N)\)
1 42.6 1.00 1.00
4 43.7 3.90 0.98
8 49.6 6.87 0.86
12 65.3 7.82 0.65
24 74.5 13.7 0.57
36 85.2 18.0 0.50
48 94.3 21.6 0.45

\(N = 8\) seems optimal, but what the heck, you’re ruining the node anyway

Better go all the way to \(N = 48\)

Tip

OMP_NUM_THREADS \(\le\) number of cores

The good news is…

Applications can be

  • Memory bound 😒
  • CPU bound 😊
  • I/O bound 😗

Warning

Benchmark you code!

Why performance degradation?

It’s the memory architecture, stupid!

  • 48 core Intel Sapphire Rapids CPU
    • 256 GB RAM, 8 memory channels at 4800 MHz
    • L3 cache per socket: 105 MB (shared, 2.2 MB per core)
    • L2 cache per CPU (core): 2 MB
    • L1 cache per CPU (core): 48 kB data
  • 36 core Intel Icelake
    • 256 GB RAM, 8 memory channels at 3200 MHz
    • L3 cache per socket: 54 MB (shared, 1.5 MB per core)
    • L2 cache per CPU (core): 1.25 MB
    • L1 cache per CPU (core): 48 kB data

How to time?

  • In job script: time

  • In job script: hyperfine

    • Runs command multiple times
    • Reports mean, median, min, max, etc.
    • Can compare multiple commands
  • For completed job

    $ sacct --cluster=<cluster> --format=JobID,TotalCPU,Elapsed --jobs=<job-id>

And another thing…

Reading/writing many small files: bad idea!

Warning

File systems for HPC optimized for large read/write operations, not for many metadata operations!

Use appropriate file types, e.g.,

  • HDF5
  • Parquet

Improving performance

Where to start?

  • Gut feeling?
  • Best guess?
  • Easy to optimize?

Tip

To measure is to know: use a profiler

Profiling

  • R has Rprof and summaryRprof
    • Easy to use
    • wrap around R script using source
rprof_wrapper.R
Rprof(interval=0.01)
source("dgemm.R")
Rprof(NULL)
summaryRprof()
1
Start profiler
2
Run script to profile
3
Stop profiler
4
Show profile

Run script with profiler

$ Rscript rprof_wrapper.R -p 1

Note

Pass command line arguments

Profile

$ Rscript rprof_wrapper.R -p 1

                  self.time self.pct total.time total.pct
"rnorm"                0.81    54.36       0.81     54.36
"%*%"                  0.59    39.60       0.59     39.60
"matrix"               0.04     2.68       0.85     57.05
"lazyLoadDBfetch"      0.03     2.01       0.03      2.01
"frameTypes"           0.02     1.34       0.02      1.34
$ Rscript rprof_wrapper.R -p 10

                  self.time self.pct total.time total.pct
"%*%"                  3.11    76.60       3.11     76.60
"rnorm"                0.86    21.18       0.86     21.18
"matrix"               0.05     1.23       0.91     22.41
"lazyLoadDBfetch"      0.02     0.49       0.02      0.49
"doTryCatch"           0.01     0.25       4.05     99.75
"as.list"              0.01     0.25       0.01      0.25

Spot bottlenecks!

Code optimization: algorithm first!

  • %*% pretty good, not much chance
  • However, \(A^p = A \cdot \cdots \cdot A\)?
    • \(p - 1\) multiplications

Think in terms of the algorithm

\[ A^{13} = A \cdot A^4 \cdot A^8 \]

Binary exponentiation: 5 versus 12 muliplications

Wrap up

Conclusions

  • Running R on HPC is easy
  • Running R on HPC efficiently… requires care

How to proceed?

If necessary, take trainings

  • HPC intro
  • Linux intro
  • Contact support

Tip

If you are a programmer, consider parallelization