Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ramble modifier to fill in allocation variables #195

Merged
merged 70 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
fe09c79
initial modifier
scheibelp Apr 2, 2024
4b1ad5e
partial work
scheibelp Apr 3, 2024
98eba09
dont mess with locals()
scheibelp Apr 3, 2024
068281e
changed variable name
scheibelp Apr 3, 2024
d33bc90
Able to proceed with Ramble#452; that uncovered a str-to-int conversi…
scheibelp Apr 3, 2024
a5941a0
remove debugging statements
scheibelp Apr 3, 2024
aa0001d
remove filled-in variable from experiment name
scheibelp Apr 3, 2024
12f092f
intermediate work on also getting modifier to generate batch submissions
scheibelp Apr 3, 2024
9642b6f
finished up work that allows the modifier to define allocations as well
scheibelp Apr 4, 2024
4bed118
style fix
scheibelp Apr 4, 2024
c25ad66
refactor away from context manager
scheibelp Apr 4, 2024
10f7d77
handle flux directives and timeout
scheibelp Apr 4, 2024
e63a142
remove unused import
scheibelp Apr 4, 2024
a39dbce
add references for clarification
scheibelp Apr 4, 2024
b8ff318
n_threads is not special; also rename it to n_omp_threads_per_task
scheibelp Apr 4, 2024
67c0c15
Merge branch 'develop' into feature/allocation-modifier
scheibelp Apr 10, 2024
3cdb92d
intermediate work
scheibelp Apr 11, 2024
f43c15e
done with doing placeholder inference based on exceeding max-request …
scheibelp Apr 11, 2024
85c1534
env_var_modification needs mode; not sure 100 percent what that shoul…
scheibelp Apr 12, 2024
18caf43
add n_cores_per_node (different than n_cores_per_rank)
scheibelp Apr 12, 2024
8d4a16d
style edits
scheibelp Apr 12, 2024
327f3f0
there can now be one execute_experiment.tpl
scheibelp Apr 12, 2024
6742b81
removal of all individual execute_experiment.tpl files
scheibelp Apr 12, 2024
1f94ff7
update all system configs except Fugaku and Sierra
scheibelp Apr 12, 2024
2742c4d
update all experiments based on (a) new names and (b) logic that fill…
scheibelp Apr 13, 2024
215ec65
style edit
scheibelp Apr 13, 2024
3534ec5
sierra batch/run cmd options implemented
scheibelp Apr 15, 2024
8e50491
add fugaku batch opt generation logic
scheibelp Apr 15, 2024
2738222
replace variables for Sierra and Fugaku
scheibelp Apr 15, 2024
8d3bd24
consolidate variable accessor logic into single class; add explanator…
scheibelp Apr 15, 2024
f7684c5
syntax error
scheibelp Apr 15, 2024
479b5e2
testing+fixing some issues for fugaku
scheibelp Apr 15, 2024
6fb2e32
typos for sierra
scheibelp Apr 15, 2024
224287d
fix sierra reference errors etc. and recognition of 'queue' as variable
scheibelp Apr 15, 2024
0021951
style fix
scheibelp Apr 15, 2024
1d579f5
apply real values to sys_cpus_per_node/sys_gpus_per_node for LLNL sys…
scheibelp Apr 19, 2024
05dd482
the scheduler used for Sierra is called 'lsf', so use that name
scheibelp Apr 19, 2024
51db624
add basic alias substitution logic (omp_num_threads can be used inste…
scheibelp Apr 19, 2024
cca0288
fix alias issue and add comments
scheibelp Apr 20, 2024
3f60df0
style fix
scheibelp Apr 20, 2024
b4a0f05
set appropriate schedulers
scheibelp Apr 20, 2024
3bac54e
scheduler on Fugaku is called 'pjm'
scheibelp Apr 20, 2024
ba68b83
all experiments need to use the allocation modifier
scheibelp Apr 23, 2024
ed025e7
amg2023 benchmark should not be doing requesting any number of ranks/…
scheibelp Apr 23, 2024
92c614c
logic to set n_ranks based on n_gpus (if the latter is set and the fo…
scheibelp Apr 23, 2024
747e473
handle the most common case of gpu specification for Flux, not using …
scheibelp Apr 23, 2024
0d72a13
add docstring
scheibelp Apr 23, 2024
a63fa5a
syntax error
scheibelp Apr 23, 2024
cc46e80
style fix
scheibelp Apr 23, 2024
d9f0b1b
Fugaku system description
scheibelp Apr 23, 2024
e1ff889
LUMI system description
scheibelp Apr 23, 2024
1b9ccb0
add reference link
scheibelp Apr 23, 2024
6122960
Piz Daint system description
scheibelp Apr 23, 2024
53a081c
add reference link
scheibelp Apr 23, 2024
301ba79
partial description of Eiger/Alps
scheibelp Apr 23, 2024
44fdb95
proper detection of unset vars; fixed error w/ calculation of n_nodes…
scheibelp Apr 23, 2024
7a71927
Both flux and lsf want gpus_per_rank
scheibelp Apr 23, 2024
564aba8
style fix
scheibelp Apr 23, 2024
59ec13d
more style fixes
scheibelp Apr 23, 2024
a7950b7
restore default nosite config
scheibelp Apr 23, 2024
7f4e1c5
missed converting input param name
scheibelp May 14, 2024
ca86837
saxpy/raja-perf cuda/rocm experiments should just specify the number …
scheibelp May 14, 2024
efb2f3c
add CI checks to exercise the allocation modifier logic (use --dry-ru…
scheibelp May 14, 2024
3f4113b
sys_cpus_per_node -> sys_cores_per_node
scheibelp May 15, 2024
30d7aa0
intercept divide-by-zero error
scheibelp May 15, 2024
5b7a521
clarify we currently only support lrun and not jsrun
scheibelp May 15, 2024
14681da
style fix
scheibelp May 15, 2024
13a0afa
Merge branch 'develop' into feature/allocation-modifier
pearce8 May 26, 2024
b4ac027
Merge branch 'develop' into feature/allocation-modifier
pearce8 Jun 5, 2024
ae8ec2d
Merge branch 'develop' into feature/allocation-modifier
pearce8 Jun 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions bin/benchpark
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,10 @@ def benchpark_setup_handler(args):
ramble_spack_experiment_configs_dir,
include_fn,
)
os.symlink(
source_dir / "experiments" / "universal-resources" / "execute_experiment.tpl",
ramble_configs_dir / "execute_experiment.tpl",
)

spack_location = experiments_root / "spack"
ramble_location = experiments_root / "ramble"
Expand Down
19 changes: 11 additions & 8 deletions configs/CSC-LUMI-HPECray-zen3-MI250X-Slingshot/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,15 @@
variables:
gtl_flag: '' # to be overwritten by tests that need GTL
rocm_arch: 'gfx90a'
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
cpu_partition: '#SBATCH -p small'
gpu_partition: '#SBATCH -p small-g'
timeout: '120'
scheduler: "slurm"
pearce8 marked this conversation as resolved.
Show resolved Hide resolved
# This describes the LUMI-G partition: https://docs.lumi-supercomputer.eu/hardware/lumig/
sys_cpus_per_node: "64"
sys_gpus_per_node: "8"
sys_mem_per_node: "512"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"

17 changes: 11 additions & 6 deletions configs/CSCS-Daint-HPECray-haswell-P100-Infiniband/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,17 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
default_cuda_version: '11.2.0'
cuda_arch: '60'
enable_mps: '/usr/tcetmp/bin/enable_mps'
timeout: '120'
scheduler: "slurm"
# This describes the XC50 compute nodes: https://www.cscs.ch/computers/piz-daint
sys_cpus_per_node: "12"
sys_gpus_per_node: "1"
sys_mem_per_node: "64"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
16 changes: 10 additions & 6 deletions configs/CSCS-Eiger-HPECray-zen2-Slingshot/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,13 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: '00:30'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
timeout: '30'
scheduler: "slurm"
sys_cpus_per_node: "128"
# sys_gpus_per_node unset
# sys_mem_per_node unset
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
14 changes: 8 additions & 6 deletions configs/LLNL-Magma-Penguin-icelake-OmniPath/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
timeout: "120"
scheduler: "slurm"
sys_cpus_per_node: "96"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,12 @@ variables:
gtl_flag: '' # to be overwritten by tests that need GTL
cuda_arch: '60'
default_cuda_version: '11.8.0'
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks} -G {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
timeout: "120"
scheduler: "slurm"
sys_cpus_per_node: "36"
sys_gpus_per_node: "2"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
16 changes: 10 additions & 6 deletions configs/LLNL-Sierra-IBM-power9-V100-Infiniband/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,15 @@

variables:
gtl_flag: '' # to be overwritten by tests that need GTL
batch_time: '02:00'
mpi_command: '/usr/tcetmp/bin/lrun -n {n_ranks} -T {processes_per_node} {gtl_flag}'
batch_submit: 'bsub -q pdebug {execute_experiment}'
batch_nodes: '#BSUB -nnodes {n_nodes}'
batch_ranks: ''
batch_timeout: '#BSUB -W {batch_time}'
default_cuda_version: '11.8.0'
cuda_arch: '70'
timeout: "120"
scheduler: "lsf"
queue: "pdebug"
sys_cpus_per_node: "44"
sys_gpus_per_node: "4"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
15 changes: 9 additions & 6 deletions configs/LLNL-Tioga-HPECray-zen3-MI250X-Slingshot/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@
variables:
gtl_flag: '' # to be overwritten by tests that need GTL
rocm_arch: 'gfx90a'
batch_time: '120m'
mpi_command: 'flux run -N {n_nodes} -n {n_ranks}'
batch_submit: 'flux batch {execute_experiment}'
batch_nodes: '# flux: -N {n_nodes}'
batch_ranks: '# flux: -n {n_ranks}'
batch_timeout: '# flux: -t {batch_time}'
timeout: "120"
scheduler: "flux"
sys_cpus_per_node: "64"
sys_gpus_per_node: "4"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
16 changes: 9 additions & 7 deletions configs/RCCS-Fugaku-Fujitsu-A64FX-TofuD/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: '02:00'
mpi_command: 'mpiexec'
batch_submit: 'pjsub {execute_experiment}'
batch_nodes: '#PJM -L "node={n_nodes}"'
batch_ranks: '#PJM --mpi proc={n_ranks}'
batch_timeout: '#PJM -L "elapse={batch_time}:00" -x PJM_LLIO_GFSCACHE="/vol0001:/vol0002:/vol0003:/vol0004:/vol0005:/vol0006"'
default_fj_version: '4.8.1'
default_llvm_version: '15.0.3'
default_gnu_version: '12.2.0'

timeout: "120"
scheduler: "pjm"
sys_cpus_per_node: "48"
sys_mem_per_node: "32"
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
15 changes: 9 additions & 6 deletions configs/nosite-AWS_PCluster_Hpc7a-zen4-EFA/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks} --mpi=pmix --export=ALL,FI_EFA_USE_DEVICE_RDMA=1,FI_PROVIDER="efa",OMPI_MCA_mtl_base_verbose=100'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
timeout: "120"
scheduler: "slurm"
sys_cpus_per_node: "1"
# sys_gpus_per_node unset
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
15 changes: 9 additions & 6 deletions configs/nosite-HPECray-zen3-MI250X-Slingshot/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@
variables:
gtl_flag: '' # to be overwritten by tests that need GTL
rocm_arch: 'gfx90a'
batch_time: '02:00'
mpi_command: 'srun -N {n_nodes} -n {n_ranks}'
batch_submit: 'sbatch {execute_experiment}'
batch_nodes: '#SBATCH -N {n_nodes}'
batch_ranks: '#SBATCH -n {n_ranks}'
batch_timeout: '#SBATCH -t {batch_time}:00'
timeout: "120"
scheduler: "slurm"
sys_cpus_per_node: "1"
# sys_gpus_per_node unset
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
14 changes: 8 additions & 6 deletions configs/nosite-x86_64/variables.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@
# SPDX-License-Identifier: Apache-2.0

variables:
batch_time: ''
mpi_command: 'mpirun -n {n_nodes} -c {n_ranks} --oversubscribe'
batch_submit: '{execute_experiment}'
batch_nodes: ''
batch_ranks: ''
batch_timeout: ''
scheduler: "mpi"
sys_cpus_per_node: "1"
# sys_gpus_per_node unset
max_request: "1000" # n_ranks/n_nodes cannot exceed this
n_ranks: '1000001' # placeholder value
n_nodes: '1000001' # placeholder value
batch_submit: "placeholder"
mpi_command: "placeholder"
pearce8 marked this conversation as resolved.
Show resolved Hide resolved
9 changes: 5 additions & 4 deletions experiments/amg2023/cuda/ramble.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ ramble:
install: '--add --keep-stage'
concretize: '-U -f'

modifiers:
- name: allocation

applications:
amg2023:
workloads:
problem1:
variables:
n_ranks: '{processes_per_node} * {n_nodes}'
p: 2
px: '{p}'
py: '{p}'
Expand All @@ -32,11 +34,10 @@ ramble:
gtl: ['gtl', 'nogtl']
gtlflag: ['-M"-gpu"', '']
experiments:
amg2023_cuda_problem1_{gtl}_{n_nodes}_{px}_{py}_{pz}_{nx}_{ny}_{nz}:
amg2023_cuda_problem1_{gtl}_{px}_{py}_{pz}_{nx}_{ny}_{nz}:
variables:
env_name: amg2023
processes_per_node: '4'
n_nodes: '2'
n_gpus: '8'
zips:
gtl_info:
- gtl
Expand Down
13 changes: 0 additions & 13 deletions experiments/amg2023/openmp/execute_experiment.tpl

This file was deleted.

14 changes: 6 additions & 8 deletions experiments/amg2023/openmp/ramble.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,14 @@ ramble:
install: '--add --keep-stage'
concretize: '-U -f'

modifier:
- name: allocation

applications:
amg2023:
workloads:
problem1:
env_vars:
set:
OMP_NUM_THREADS: '{omp_num_threads}'
variables:
n_ranks: '{processes_per_node} * {n_nodes}'
p: 2
px: '{p}'
py: '{p}'
Expand All @@ -32,18 +31,17 @@ ramble:
nx: '{n}'
ny: '{n}'
nz: '{n}'
processes_per_node: ['8', '4']
n_ranks_per_node: ['8', '4']
n_nodes: ['1', '2']
threads_per_node_core: ['4', '6', '12']
omp_num_threads: '{threads_per_node_core} * {n_nodes}'
n_threads_per_proc: ['4', '6', '12']
experiments:
amg2023_omp_problem1_{n_nodes}_{omp_num_threads}_{px}_{py}_{pz}_{nx}_{ny}_{nz}:
variables:
env_name: amg2023-omp
matrices:
- size_threads:
- n
- threads_per_node_core
- n_threads_per_proc
spack:
concretized: true
packages:
Expand Down
13 changes: 0 additions & 13 deletions experiments/amg2023/rocm/execute_experiment.tpl

This file was deleted.

9 changes: 5 additions & 4 deletions experiments/amg2023/rocm/ramble.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ ramble:
install: '--add --keep-stage'
concretize: '-U -f'

modifiers:
- name: allocation

applications:
amg2023:
workloads:
problem1:
variables:
n_ranks: '{processes_per_node} * {n_nodes}'
p: 2
px: '{p}'
py: '{p}'
Expand All @@ -30,12 +32,11 @@ ramble:
ny: '{n}'
nz: '{n}'
experiments:
'{env_name}_problem1_{n_nodes}_{px}_{py}_{pz}_{nx}_{ny}_{nz}':
'{env_name}_problem1_{px}_{py}_{pz}_{nx}_{ny}_{nz}':
variables:
gtl: ["gtl", "no-gtl"]
env_name: 'amg2023-gpu-{gtl}'
processes_per_node: ['8', '4']
n_nodes: ['1', '2']
n_gpus: '8'
matrices:
- size_gtl:
- n
Expand Down
16 changes: 0 additions & 16 deletions experiments/gromacs/cuda/execute_experiment.tpl

This file was deleted.

Loading