[Fix] SLURM distributed training in containers where `scontrol` is not available #1527

R-Fehler · 2024-04-03T14:53:28Z

Motivation

Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.

by checking if scontrol is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not have scontrol available.

Additionally this PR fixes these issues:

open-mmlab/mmcv#1970
open-mmlab/mmcv#700

Modification

two functions are added to mmengine/dist/utils.py:

_slurm_extract_first_node(slurm_nodelist): replaces scontrol to set the addr of the the master node if needed. this is checked via this function:
_is_scontrol_available()

BC-breaking (Optional)

no breaking changes are introduced.

Use cases (Optional)

HPC Training within containerized environments.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDetection or MMPretrain.
The documentation has been modified accordingly, like docstring or example tutorials.

✔️

CLAassistant · 2024-04-03T14:53:35Z

All committers have signed the CLA.

R-Fehler added 2 commits April 3, 2024 16:30

handle missing scontrol dependency for SLURM

383e2bf

private function naming

cafc4aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] SLURM distributed training in containers where `scontrol` is not available #1527

[Fix] SLURM distributed training in containers where `scontrol` is not available #1527

R-Fehler commented Apr 3, 2024

CLAassistant commented Apr 3, 2024 •

edited

Loading

[Fix] SLURM distributed training in containers where scontrol is not available #1527

Are you sure you want to change the base?

[Fix] SLURM distributed training in containers where scontrol is not available #1527

Conversation

R-Fehler commented Apr 3, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

CLAassistant commented Apr 3, 2024 • edited Loading

[Fix] SLURM distributed training in containers where `scontrol` is not available #1527

[Fix] SLURM distributed training in containers where `scontrol` is not available #1527

CLAassistant commented Apr 3, 2024 •

edited

Loading