[Fix] SLURM distributed training in containers where scontrol
is not available
#1527
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.
by checking if
scontrol
is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not havescontrol
available.Additionally this PR fixes these issues:
open-mmlab/mmcv#1970
open-mmlab/mmcv#700
Modification
two functions are added to
mmengine/dist/utils.py
:_slurm_extract_first_node(slurm_nodelist):
replaces scontrol to set theaddr
of the the master node if needed. this is checked via this function:_is_scontrol_available()
BC-breaking (Optional)
no breaking changes are introduced.
Use cases (Optional)
HPC Training within containerized environments.
Checklist
✔️