Use Ramble modifier to fill in allocation variables #195

scheibelp · 2024-04-03T06:25:15Z

Closes #178
Fixes #221

Given an experiment that requests resources (nodes, cpus, gpus, etc.) and a system description (cpus-per-node, gpus-per-node, etc.) this intends to generate an appropriate scheduler request for resources. In some cases that ends up determining things like how many nodes are desired for a given benchmark.

#178 (comment) brings up some more-interesting examples like these, and this PR is an alternative approach.

This requires a newer Ramble than what Benchpark currently uses by default (right now I'm using https://github.com/GoogleCloudPlatform/ramble/pull/452).

./bin/benchpark setup saxpy/openmp nosite-x86_64 `pwd`/test-saxpy

Remaining work:

remove all experiment-specific execute_experiment.tpl files
Implement scheduler definition function for Sierra and Fugaku
Update all experiment files and all system config files (currently just experiments/saxpy/openmp and configs/nosite-x86_64 are changed to demonstrate the organization)
All experiment/*/*/ramble.yaml files have been translated, but need further updates to actually describe system resources (e.g. number of GPUs on each node etc.)
- (April 22 2024) All LLNL systems are now updated with # of CPUs/GPUs per-node (in the latter case, only for systems that have them)
- (April 23 2024) All Systems except Eiger are now updated (note that LUMI and Daint have partitions with different types of nodes, and currently the variables only describe one type)
(May 14 2024) Update CI to do a ramble workspace setup --dry-run of some configs and experiments: this actually runs the modifier defined here to generate batch scripts etc. with all resource requests filled in

Testing:

You can run any one of the following on any system

./bin/benchpark setup saxpy/openmp nosite-x86_64 <basedir>
./bin/benchpark setup amg2023/cuda LLNL-Sierra-IBM-power9-V100-Infiniband <basedir>
./bin/benchpark setup amg2023/cuda LLNL-Pascal-Penguin-broadwell-P100-OmniPath <basedir>

For the ramble workspace setup command it tells you to run, just append --phases make_experiments to the end of it (that will skip the concretize/install steps).

Oddities:

(April 23 2024) LUMI/Daint nodes have different characteristics based on what partition you request. For now, I only describe one type of node. I think we can handle this in the future by creating different configs based on what partition the user wants to submit to.
(April 12 2024) The GROMACS execute_experiment.tpl files are slightly different than the others: they have an extra {experiment_setup}; everything sets that variable to '' though, so I don't see a problem with removing them as well
(EDIT: now resolved) Some values must be defined before the modifier runs, e.g. n_ranks. I've arbitrarily decided the placeholder value for these is "7" (they must be positive integers, so I decided to choose a number that was (a) unlikely to be explicitly chosen and (b) small (in case they percolate to actual requests)

modifiers/allocation/modifier.py

…on issue

configs/nosite-x86_64/variables.yaml

experiments/saxpy/openmp/execute_experiment.tpl

scheibelp · 2024-04-04T06:08:30Z

Example script generated from experiments/saxpy/openmp and configs/nosite-x86_64 (tweaked to assume slurm, for a more interesting output):

#SBATCH -n 8
#SBATCH -N 1
#SBATCH --time 120

cd <benchpark-prefix>/test-saxpy-oslic-new/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2

rm -f "<benchpark-prefix>/test-saxpy-oslic-new/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"
touch "<benchpark-prefix>/test-saxpy-oslic-new/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"
export OMP_NUM_THREADS="2";
. <benchpark-prefix>/test-saxpy-oslic-new/spack/share/spack/setup-env.sh
spack env activate <benchpark-prefix>/test-saxpy-oslic-new/saxpy/openmp/nosite-x86_64/workspace/software/saxpy.problem
srun -n 8 -N 1 saxpy -n 512 >> "<benchpark-prefix>/test-saxpy-oslic-new/saxpy/openmp/nosite-x86_64/workspace/experiments/saxpy/problem/saxpy_512_1_2/saxpy_512_1_2.out"

configs/nosite-x86_64/variables.yaml

modifiers/allocation/modifier.py

… from n_gpus

…of gpus they want

modifiers/allocation/modifier.py

…n to avoid doing concretization/install as part of ramble workspace setup)

scheibelp added 3 commits April 2, 2024 16:18

initial modifier

fe09c79

partial work

4b1ad5e

dont mess with locals()

98eba09

scheibelp marked this pull request as draft April 3, 2024 06:25

github-actions bot added experiment New or modified experiment configs New or modified system config labels Apr 3, 2024

changed variable name

068281e

scheibelp commented Apr 3, 2024

View reviewed changes

modifiers/allocation/modifier.py Outdated Show resolved Hide resolved

scheibelp added 6 commits April 3, 2024 13:24

Able to proceed with Ramble#452; that uncovered a str-to-int conversi…

d33bc90

…on issue

remove debugging statements

a5941a0

remove filled-in variable from experiment name

aa0001d

intermediate work on also getting modifier to generate batch submissions

12f092f

finished up work that allows the modifier to define allocations as well

9642b6f

style fix

4bed118

scheibelp commented Apr 4, 2024

View reviewed changes

configs/nosite-x86_64/variables.yaml Show resolved Hide resolved

refactor away from context manager

c25ad66

scheibelp commented Apr 4, 2024

View reviewed changes

experiments/saxpy/openmp/execute_experiment.tpl Outdated Show resolved Hide resolved

scheibelp added 2 commits April 3, 2024 22:53

handle flux directives and timeout

10f7d77

remove unused import

e63a142

add references for clarification

a39dbce