Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] SLURM distributed training in containers where scontrol is not available #1527

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

R-Fehler
Copy link

@R-Fehler R-Fehler commented Apr 3, 2024

Motivation

Distributed training with SLURM does not work if you run it in a containerized environment such as docker, enroot or apptainer/singularity on a HPC.

by checking if scontrol is available and retreiving the master adress without it if necessary, training with SLURM is now possible on the HPC of research center juelich in germany (https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) within a containerized environment. This should generalize to all systems where containers do not have scontrol available.

Additionally this PR fixes these issues:

open-mmlab/mmcv#1970
open-mmlab/mmcv#700

Modification

two functions are added to mmengine/dist/utils.py:

_slurm_extract_first_node(slurm_nodelist): replaces scontrol to set the addr of the the master node if needed. this is checked via this function:
_is_scontrol_available()

BC-breaking (Optional)

no breaking changes are introduced.

Use cases (Optional)

HPC Training within containerized environments.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDetection or MMPretrain.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

✔️

@CLAassistant
Copy link

CLAassistant commented Apr 3, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants