Simply copy this example.slurm and adapt it to your needs.
In this doc we will use an example setup with these 2 cluster names:
dev
prod
To find out the hostname of the nodes and their availability, use:
sinfo -p dev
sinfo -p prod
Slurm configuration is at /opt/slurm/etc/slurm.conf
.
To see the configuration of all partitions:
scontrol show partition
squeue -u `whoami` --start
will show when any pending jobs are scheduled to start.
They may start sooner if others cancel their reservations before the end of the reservation.
To schedule a new job when one more of the currently scheduled job ends (regardless of whether it still running or not started yet), use the dependency mechanism, by telling sbatch
to start the new job once the currently running job succeeds, using:
sbatch --dependency=CURRENTLY_RUNNING_JOB_ID tr1-13B-round1.slurm
Using --dependency
may lead to shorter wait times that using --begin
, since if the time passed to --begin
allows even for a few minutes of delay since the stopping of the last job, the scheduler may already start some other jobs even if their priority is lower than our job. That's because the scheduler ignores any jobs with --begin
until the specified time arrives.
To postpone making the allocation for a given time, use:
salloc --begin HH:MM MM/DD/YY
Same for sbatch
.
It will simply put the job into the queue at the requested time, as if you were to execute this command at this time. If resources are available at that time, the allocation will be given right away. Otherwise it'll be queued up.
Sometimes the relative begin time is useful. And other formats can be used. Examples:
--begin now+2hours
--begin=16:00
--begin=now+1hour
--begin=now+60 # seconds by default
--begin=2010-01-20T12:34:00
the time-units can be seconds
(default), minutes
, hours
, days
, or weeks
:
This is very useful for running repetitive interactive experiments - so one doesn't need to wait for an allocation to progress. so the strategy is to allocate the resources once for an extended period of time and then running interactive srun
jobs using this allocation.
set --time
to the desired window (e.g. 6h):
salloc --partition=dev --nodes=1 --ntasks-per-node=1 --cpus-per-task=96 --gres=gpu:8 --time=6:00:00 bash
salloc: Pending job allocation 1732778
salloc: job 1732778 queued and waiting for resources
salloc: job 1732778 has been allocated resources
salloc: Granted job allocation 1732778
now use this reserved node to run a job multiple times, by passing the job id of salloc
:
srun --jobid $SLURM_JOBID --pty bash
if run from inside bash
started via salloc
. But it can be started from another shell, but then explicitly set --jobid
.
if this srun
job timed out or manually exited, you can re-start it again in this same reserved node.
srun
can, of course, call the real training command directly and not just bash
.
Important: when allocating a single node, the allocated shell is not on the node (it never is). You have to find out the hostname of the node (reports when giving the allocation or via squeue
and ssh
to it.
When finished, to release the resources, either exit the shell started in salloc
or scancel JOBID
.
This reserved node will be counted towards hours usage the whole time it's allocated, so release as soon as done with it.
Actually, if this is just one node, then it's even easier to not use salloc
but to use srun
in the first place, which will both allocate and give you the shell to use:
srun --pty --partition=dev --nodes=1 --ntasks=1 --cpus-per-task=96 --gres=gpu:8 --time=60 bash
By default, if the cpu has Hyper-Threads (HT) enabled, SLURM will use it. If you don't want to use HT you have to specify --hint=nomultithread
.
footnote: HT is Intel-specific naming, the general concept is simultaneous multithreading (SMT)
For example for a cluster with with 2 cpus per node with 24 cores and 2 hyper-threads each, there is a total of 96 hyper-threads or 48 cpu-cores available. Therefore to utilize the node fully you'd need to configure either:
#SBATCH --cpus-per-task=96
or if you don't want HT:
#SBATCH --cpus-per-task=48
#SBATCH --hint=nomultithread
This last approach will allocate one thread per core and in this mode there are only 48 cpu cores to use.
Note that depending on your application there can be quite a performance difference between these 2 modes. Therefore try both and see which one gives you a better outcome.
On some setups like AWS the all-reduce throughput degrades dramatically when --hint=nomultithread
is used! Whereas on some other setups the opposite is true - the throughput is worse without HT!
To check if your instances has HT enabled, run:
$ lscpu | grep Thread
Thread(s) per core: 2
If it's 2
then it is HT-enabled, if it's 1
then it isn't.
e.g. when wanting to run various jobs on identical node allocation.
In one shell:
salloc --partition=prod --nodes=16 --ntasks=16 --cpus-per-task=96 --gres=gpu:8 --time=3:00:00 bash
echo $SLURM_JOBID
In another shell:
export SLURM_JOBID=<JOB ID FROM ABOVE>
srun --jobid=$SLURM_JOBID ...
You may need to set --gres=gpu:0
to run some diagnostics job on the nodes. For example, let's check shared memory of all the hosts:
srun --jobid 631078 --gres=gpu:0 bash -c 'echo $(hostname) $(df -h | grep shm)'
To exclude specific nodes (useful when you know some nodes are broken, but are still in IDLE state):
sbatch --exclude nodeA,nodeB
or via: #SBATCH --exclude ...
To use specific nodes:
sbatch --nodelist= nodeA,nodeB
can also use the short -w
instead of --nodelist
The administrator could also define a feature=example
in slurm.conf
and then a user could ask for that subset of nodes via --constraint=example
Since each SLURM run has a limited time span, it can be configured to send a signal of choice to the program a desired amount of time before the end of the allocated time.
--signal=[[R][B]:]<sig_num>[@<sig_time>]
TODO: need to experiment with this to help training finish gracefully and not start a new cycle after saving the last checkpoint.
While most useful information is preset in various SLURM_*
env vars, sometimes some information is missing. In such cases use:
scontrol show -d job $SLURM_JOB_ID
and then parse out what's needed.
For a job that finished its run use:
sacct -j JOBID
This command is also useful to discover if you have any srun
jobs already running on that allocation (including those that were finished or cancelled). For example, you could kill some run-away srun
step via scancel <jobid>.<step-id>
and you'd find that <step-id>
via the above command. The main job will continue running if it's an interactive job even if you cancelled all step jobs.
To see more details:
sacct -ojobid,start,end,state,exitcode --format nodelist%300 -j JOBID
sacct -j JOBID --long
Or to see all jobs with their sub-steps while limiting the listing to a specific partition and only for your own user:
sacct -u `whoami` --partition=dev -ojobid,start,end,state,exitcode --format nodelist%300
sacct -u `whoami` --partition=prod -ojobid,start,end,state,exitcode --format nodelist%300
To see how a particular job was launched and all of its srun
sub-step command lines:
sacct -j JOBID -o submitline -P
Show only my jobs:
squeue -u `whoami`
Show jobs by job id:
squeue -j JOBID
Show jobs of a specific partition:
squeue --partition=dev
Handy aliases
alias myjobs='squeue -u `whoami` -o "%.16i %9P %26j %.8T %.10M %.8l %.6D %.20S %R"'
alias groupjobs='squeue -u foo,bar,tar -o "%.16i %u %9P %26j %.8T %.10M %.8l %.6D %.20S %R"'
alias myjobs-pending="squeue -u `whoami` --start"
alias idle-nodes="sinfo -p prod -o '%A'"
If there are any zombies left behind across nodes, send one command to kill them all.
srun pkill python
sacct
displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.
So this is a great tool for analysing past events.
For example, to see which nodes were used to run recent gpu jobs:
sacct -u `whoami` --partition=dev -ojobid,start,end,state,exitcode --format nodelist%300
%300
here tells it to use a 300 char width for the output, so that it's not truncated.
See man sacct
for more fields and info fields.
To cancel a job:
scancel [jobid]
To cancel all of your jobs:
scancel -u <userid>
To cancel all of your jobs on a specific partition:
scancel -u <userid> -p <partition>
- if you see that
salloc
'ed interactive job is scheduled to run much later than you need, try to cancel the job and ask for shorter period - often there might be a closer window for a shorter time allocation.
If we need to separate logs to different log files per node add %N
(for short hostname) so that we have:
#SBATCH --output=%x-%j-%N.out
That way we can tell if a specific node misbehaves - e.g. has a corrupt GPU. This is because currently pytorch doesn't log which node / gpu rank triggered an exception.
Hoping it'll be a built-in feature of pytorch pytorch/pytorch#63174 and then one won't need to make things complicated on the logging side.
sinfo -p PARTITION
Very useful command is:
sinfo -s
and look for the main stat, e.g.:
NODES(A/I/O/T) "allocated/idle/other/total".
597/0/15/612
So here 597 out of 612 nodes are allocated. 0 idle and 15 are not available for whatever other reasons.
sinfo -p gpu_p1 -o "%A"
gives:
NODES(A/I)
236/24
so you can see if any nodes are available on the 4x v100-32g partition (gpu_p1
)
To check a specific partition:
sinfo -p gpu_p1 -o "%A"
See the table at the top of this document for which partition is which.
- idle: no jobs running
- alloc: nodes are allocated to jobs that are currently executing
- mix: the nodes have some of the CPUs allocated, while others are idle
- drain: the node is unavailable due to an administrative reason
- drng: the node is running a job, but will after completion not be available due to an administrative reason
The node state could be followed by a single character which has a special meaning. It is one of:
*
: The node is presently not responding and will not be allocated any new work. If the node remains non-responsive, it will be placed in the DOWN state (except in the case of COMPLETING, DRAINED, DRAINING, FAIL, FAILING nodes).~
: The node is presently in powered off.#
: The node is presently being powered up or configured.!
: The node is pending power down.%
: The node is presently being powered down.$
: The node is currently in a reservation with a flag value of "maintenance".@
: The node is pending reboot.^
: The node reboot was issued.-
: The node is planned by the backfill scheduler for a higher priority job.
CD
| Completed: The job has completed successfully.CG
| Completing: The job is finishing but some processes are still active.F
| Failed: The job terminated with a non-zero exit code and failed to execute.PD
| Pending: The job is waiting for resource allocation. It will eventually run.PR
| Preempted: The job was terminated because of preemption by another job.R
| Running: The job currently is allocated to a node and is running.S
| Suspended: A running job has been stopped with its cores released to other jobs.ST
| Stopped: A running job has been stopped with its cores retained.
To see all drained nodes and the reason for drainage (edit %50E
to make the reason field longer/shorter)
% sinfo -R -o "%50E %12U %19H %6t %N"
or just -R
if you want it short:
% sinfo -R
To run a sequence of jobs, so that the next slurm job is scheduled as soon as the currently running one is over in 20h we use a job array.
Let's start with just 10 such jobs:
sbatch --array=1-10%1 array-test.slurm
%1
limits the number of simultaneously running tasks from this job array to 1. Without it it will try to run all the jobs at once, which we may want sometimes (in which case remove %1), but when training we need one job at a time.
Alternatively, as always this param can be part of the script:
#SBATCH --array=1-10%1
Here is toy slurm script, which can be used to see how it works:
#!/bin/bash
#SBATCH --job-name=array-test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=1 # number of cores per tasks
#SBATCH --time 00:02:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --error=%x-%j.out # error file name (same to watch just one file)
#SBATCH --partition=dev
echo $SLURM_JOB_ID
echo "I am job ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"
date
sleep 10
date
Note $SLURM_ARRAY_JOB_ID
is the same as $SLURM_JOB_ID
, and $SLURM_ARRAY_TASK_ID
is the index of the job.
To see the jobs running:
$ squeue -u `whoami` -o "%.10i %9P %26j %.8T %.10M %.6D %.20S %R"
JOBID PARTITION NAME STATE TIME NODES START_TIME NODELIST(REASON)
591970_[2- dev array-test PENDING 0:00 1 2021-07-28T20:01:06 (JobArrayTaskLimit)
now job 2 is running.
To cancel the whole array, cancel the job id as normal (the number before _
):
scancel 591970
To cancel a specific job:
scancel 591970_2
If it's important to have the log-file contain the array id, add %A_%a
:
#SBATCH --output=%x-%j.%A_%a.log
More details https://slurm.schedmd.com/job_array.html
In this recipe we accomplish 2 things:
- Allow modification to the next job's slurm script
- Allow suspending and resuming job arrays w/o losing the place in the queue when not being ready to continue running a job
SLURM is a very unforgiving environment where a small mistake can cost days of waiting time. But there are strategies to mitigate some of this harshness.
SLURM jobs have a concept of "age" in the queue which besides project priority governs when a job gets scheduled to run. If your have just scheduled a new job it has no "age" and will normally be put to run last compared to jobs that have entered the queue earlier. Unless of course this new job comes from a high priority project in which case it'll progress faster.
So here is how one can keep the "age" and not lose it when needing to fix something in the running script or for example to switch over to another script.
The idea is this:
sbatch
a long job array, e.g.,-array=1-50%1
- inside the slurm script don't have any code other than
source another-script.slurm
- so now you can modify the target script or symlink to another script before the next job starts - if you need to stop the job array train - don't cancel it, but suspend it without losing your place in a queue
- when ready to continue - unsuspend the job array - only the time while it was suspended is not counted towards its age, but all the previous age is retained.
The number of nodes, time and hardware and partition of a running job cannot be modified, but you can change pending jobs in the job array by scontrol update jobid=<desired_job_id> numnodes=<new number> partition=<new partition>
.
If you do have sudo
access then you can change the job time of the current job as well.
Here is an example:
Create a job script:
$ cat train-64n.slurm
#!/bin/bash
#SBATCH --job-name=tr8-104B
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=dev
source tr8-104B-64.slurm
Start it as:
sbatch --array=1-50%1 train-64.slurm
Now you can easily edit tr8-104B-64.slurm
before the next job run and either let the current job finish if it's desired or if you need to abort it, just kill the currently running job, e.g. 1557903_5
(not job array 1557903
) and have the train pick up where it left, but with the edited script.
The nice thing is that this requires no changes to the original script (tr8-104B-64.slurm
in this example), and the latter can still be started on its own.
Now, what if something is wrong and you need 10min or 10h to fix something. In this case we suspend the train using:
scontrol hold <jobid>
with being either a "normal" job, the id of a job array or the id for a job array step
and then when ready to continue release the job:
scontrol release <jobid>
If you run allocated a node like so:
salloc --partition=dev --nodes=1 --ntasks-per-node=1 --time=1:00:00 bash
and you exited the shell, or your ssh connection got dropped, the allocation will be lost.
If you want to open an allocation that should survive exiting the shell, use --no-shell
and no bash
like so:
salloc --no-shell --partition=dev --nodes=1 --ntasks-per-node=1 --time=1:00:00
and now if you need to join the session see How to rejoin the allocated node interactively.
But beware, that if you ssh
to the allocated node and launch something normally and then close the connection that job will be lost as the connecting shell will send SIGHUP
to its child processes. To avoid that and to keep the job running use nohup
while putting the program into a background process. Example:
nohup my-program &
nohup
will ignore SIGHUP
and will redirect stderr to stdout and append stdout to a special file nohup.out
. If you want to control where the std streams should be written use normal stdout redirect >
or >>
, e.g.:
nohup my-program >> some-file.txt &
As mentioned earlier the program is also sent into the background with &
.
Now you can safely disconnect and the program will continue running when you come back.
This solution will prevent the program from exiting, but you won't be able to interact with it normally when you connect again as the std streams will be redirected. You can of course still kill the program via its pid, chance its nice
state, etc., like you'd do with any other process.
But if you want to use something where you can disconnect and reconnect and continue using the program normally you'd have to use a terminal multiplexer program like tmux
or GNU screen
which run a daemon on the node and allow you to regain the normal control over the program on reconnection. There are also mosh
and other similar tools which further aid this process.
To have multiple interactive shells into the same job --overlap
should be used.
For example, in console A, let's allocate a single node:
$ salloc --partition=dev --nodes=1 --ntasks-per-node=1 --cpus-per-task=26 --gres=gpu:1 --time=2:00:00 bash
salloc: Granted job allocation 1916
salloc: Nodes my-node-1 are ready for job
In console B:
$ srun --overlap --pty --jobid 101 bash
and the above can be repeated in as many consoles as wanted.
If it's the first pseudo terminal shell you don't even need --overlap
, but you need it for the additional shells.
It works the same if you initially allocated the node via srun --pty
srun --pty -p dev --gpus 8 --time=2:00:00 bash
You can, of course, also access the node via ssh
but if your SLURM has been setup to do all kinds of virtualizations (e.g. give only a few GPUs to each user, or virtualize /tmp/
or /scratch
with auto-cleanup on exit), the view from ssh
won't be the same. For example, if a job allocated 2 GPUs, the ssh shell will show all of the GPUs and not just the 2 - so if you're sharing the node with others this won't work well.
This works for multi-node allocations and by default you will get an interactive shell on the first node of the allocation. If you want to enter a specific node use -w
to specify it. For example, say you got node-[1-4]
allocated and you want to enter node-3
, then specify:
srun --pty -p dev --gpus 8 --time=2:00:00 -w node-3 bash
and if it fails with:
srun: error: Unable to create step for job 1930: Invalid generic resource (gres) specification
add back the --gres=gpu:8
setting. You won't need to do it if your original allocation command used this flag already.
When using SLURM with multi-node setup it's crucial that this is set correctly:
"--machine_rank \$SLURM_PROCID"
it must not be interpolated before time, since if this is set as "--machine_rank $SLURM_PROCID"
the launcher will hang.
It's best to isolate the launcher from the program like so:
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=3333
ACCELERATE_CONFIG_FILE=path/to/accelerate.config.yaml # edit me
LAUNCHER="python -u -m accelerate.commands.launch \
--rdzv_conf "rdzv_backend=c10d,rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT" \
--config_file $ACCELERATE_CONFIG_FILE \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID \
--role \$(hostname -s|tr -dc '0-9'): --tee 3 \
"
PROGRAM="myprogram.py"
CMD="$LAUNCHER $PROGRAM"
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
--unbuffered \
--jobid $SLURM_JOBID \
"
srun $SRUN_ARGS bash -c "$CMD" 2>&1 | tee -a main_log.txt
Now the launcher will always work and the users will only need to tweak the PROGRAM
variable.
With torchrun
:
export $GPUS_PER_NODE=8
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=3333
LAUNCHER="python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank \$SLURM_PROCID
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
--rdzv_backend c10d \
--max_restarts 0 \
--role `hostname -s`:--tee 3 \
"
See Single and Multi-node Launchers with SLURM for complete working examples.
If the pytorch launcher fails it often means that the number of SLURM nodes and the launcher nodes are mismatching, e.g.:
grep -ir nodes= tr123-test.slurm
#SBATCH --nodes=40
NNODES=64
This won't work. They have to match.
You can add a sanity check to your script:
#!/bin/bash
#SBATCH --job-name=test-mismatch
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 0:05:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=prod
[...]
NNODES=2
# sanity check for having NNODES and `#SBATCH --nodes` match, assuming you use NNODES variable
if [ "$NNODES" != "$SLURM_NNODES" ]; then
echo "Misconfigured script: NNODES=$NNODES != SLURM_NNODES=$SLURM_NNODES"
exit 1
fi
[...]
or you could just do:
#SBATCH --nodes=2
[...]
NNODES=$SLURM_NNODES
and then it will always be correct
Sometimes a node is broken, which prevents one from training, especially since restarting the job often hits the same set of nodes. So one needs to be able to isolate the bad node(s) and exclude it from sbatch
.
To find a faulty node, write a small script that reports back the status of the desired check.
For example to test if cuda is available on all nodes:
python -c 'import torch, socket; print(f"{socket.gethostname()}: {torch.cuda.is_available()}")'
and to only report the nodes that fail:
python -c 'import torch, socket; torch.cuda.is_available() or print(f"Broken node: {socket.gethostname()}") '
Of course, the issue could be different - e.g. gpu can't allocate memory, so change the test script to do a small allocation on cuda. Here is one way:
python -c "import torch; torch.ones(1000,1000).cuda()"
But since we need to run the test script on all nodes and not just the first node, the slurm script needs to run it via srun
. So our first diagnostics script can be written as:
srun --jobid $SLURM_JOBID bash -c 'python -c "import torch, socket; print(socket.gethostname(), torch.cuda.is_available())"'
I slightly changed it, due to an issue with quotes.
You can always convert the one liner into a real script and then there is no issue with quotes.
$ cat << EOT >> test-nodes.py
#!/usr/bin/env python
import torch, socket
print(socket.gethostname(), torch.cuda.is_available())
EOT
$ chmod a+x ./test-nodes.py
Now let's create a driver slurm script. Use a few minutes time for this test so that SLURM yields it faster:
#!/bin/bash
#SBATCH --job-name=test-nodes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 0:05:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=prod
source $six_ALL_CCFRWORK/start-prod
srun --jobid $SLURM_JOBID ./test-nodes.py
Once it runs check the logs to see if any reported False
, those are the nodes you want to exclude.
Now once the faulty node(s) is found, feed it to sbatch
:
sbatch --exclude=hostname1,hostname2 ...
and sbatch
will exclude the bad nodes from the allocation.
Additionally please report the faulty nodes to #science-support
so that they get replaced
Here are a few more situations and how to find the bad nodes in those cases:
If you're testing something that requires distributed setup, it's a bit more complex. Here is a slurm script that tests that NCCL works. It sets up NCCL and checks that barrier works:
#!/bin/bash
#SBATCH --job-name=test-nodes-nccl
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=96 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 0:05:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=prod
source $six_ALL_CCFRWORK/start-prod
NNODES=2
GPUS_PER_NODE=4
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
export LAUNCHER="python -u -m torch.distributed.launch \
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
"
export SCRIPT=test-nodes-nccl.py
cat << EOT > $SCRIPT
#!/usr/bin/env python
import torch.distributed as dist
import torch
import socket
import os
import fcntl
def printflock(*msgs):
""" print """
with open(__file__, "r") as fh:
fcntl.flock(fh, fcntl.LOCK_EX)
try:
print(*msgs)
finally:
fcntl.flock(fh, fcntl.LOCK_UN)
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
dist.init_process_group("nccl")
header = f"{socket.gethostname()}-{local_rank}"
try:
dist.barrier()
printflock(f"{header}: NCCL {torch.cuda.nccl.version()} is OK")
except:
printflock(f"{header}: NCCL {torch.cuda.nccl.version()} is broken")
raise
EOT
echo $LAUNCHER --node_rank $SLURM_PROCID $SCRIPT
srun --jobid $SLURM_JOBID bash -c '$LAUNCHER --node_rank $SLURM_PROCID $SCRIPT'
The script uses printflock
to solve the interleaved print outputs issue.
This tests if each GPU on the allocated nodes can successfully allocate 77Gb (e.g. to test 80GB A100s) (have to subtract a few GBs for cuda kernels).
import torch, os
import time
import socket
hostname = socket.gethostname()
local_rank = int(os.environ["LOCAL_RANK"]);
gbs = 77
try:
torch.ones((gbs*2**28)).cuda(local_rank).contiguous() # alloc on cpu, then move to gpu
print(f"{local_rank} {hostname} is OK")
except:
print(f"{local_rank} {hostname} failed to allocate {gbs}GB DRAM")
pass
time.sleep(5)
Yet another issue with a node is when its network is broken and other nodes fail to connect to it.
You're likely to experience it with an error similar to:
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Here is how to debug this issue:
- Add:
export NCCL_DEBUG=INFO
before the srun
command and re-run your slurm script.
- Now study the logs. If you find:
r11i6n2:486514:486651 [1] include/socket.h:403 NCCL WARN Connect to 10.148.3.247<56821> failed : Connection refused
Let's see which node refuses to accept connections. We get the IP address from the error above and reverse resolve it to its name:
nslookup 10.148.3.247
247.3.148.10.in-addr.arpa name = r10i6n5.ib0.xa.idris.fr.
Add --exclude=r10i6n5
to your sbatch
command and report it to JZ admins.
When dealing with hanging, here is how to automatically log py-spy
traces for each process.
Of course, this same process can be used to run some command for all nodes of a given job. i.e. it can be used to run something during the normal run - e.g. dump all the memory usage in each process via nvidia-smi
or whatever other program is needed to be run.
cd ~/prod/code/tr8b-104B/bigscience/train/tr11-200B-ml/
salloc --partition=prod --nodes=40 --ntasks-per-node=1 --cpus-per-task=96 --gres=gpu:8 --time 20:00:00
bash 200B-n40-bf16-mono.slurm
In another shell get the JOBID for the above salloc
:
squeue -u `whoami` -o "%.16i %9P %26j %.8T %.10M %.8l %.6D %.20S %R"
adjust jobid per above and the nodes count (XXX: probably can remove --nodes=40
altogether and rely on salloc
config):
srun --jobid=2180718 --gres=gpu:0 --nodes=40 --tasks-per-node=1 --output=trace-%N.out sh -c 'ps aux | grep python | egrep -v "grep|srun" | grep `whoami` | awk "{print \$2}" | xargs -I {} py-spy dump --native --pid {}' || echo "failed"
now all py-spy
traces go into the trace-$nodename.out
files under cwd
.
The key is to use --gres=gpu:0
or otherwise the 2nd srun
will block waiting for the first one to release the gpus.
Also the assumption is that some conda env that has py-spy
installed got activated in ~/.bashrc
. If yours doesn't already do that, add the instruction to load the env to the above command, before the py-spy
command - it'll fail to find it otherwise.
Don't forget to manually release the allocation when this process is done.
Some multi-node launchers require a hostfile
- here is how to generate one:
# autogenerate the hostfile for deepspeed
# 1. deals with: SLURM_JOB_NODELIST in either of 2 formats:
# r10i1n8,r10i2n0
# r10i1n[7-8]
# 2. and relies on SLURM_STEP_GPUS=0,1,2... to get how many gpu slots per node
#
# usage:
# makehostfile > hostfile
function makehostfile() {
perl -le '$slots=split /,/, $ENV{"SLURM_STEP_GPUS"}; $_=$ENV{"SLURM_JOB_NODELIST"}; if (/^(.*?)\[(\d+)-(\d+)\]/) { print map { "$1$_ slots=$slots\n" } $2..$3} elsif (/,/) { print map { "$1$_ slots=$slots\n" } split /,/ } '
}
You can always do:
export SOMEKEY=value
from the slurm script to get a desired environment variable passed to the program launched from it.
And you can also add to the top of the slurm script:
#SBATCH --export=ALL
The launched program will see all the environment variables visible in the shell where it was launched from.
One of the most important Unix tools is the crontab, which is essential for being able to schedule various jobs. It however usually is absent from SLURM environment. Therefore one must emulate it. Here is how.
For this presentation we are going to use $WORK/cron/
as the base directory. And that you have an exported environment variable WORK
pointing to some location on your filesystem - if you use Bash you can set it up in your ~/.bash_profile
or if a different shell is used use whatever startup equivalent file is.
We will use $WORK/cron/scheduler
dir for scheduler jobs, $WORK/cron/cron.daily
for daily jobs and $WORK/cron/cron.hourly
for hourly jobs:
$ mkdir -p $WORK/cron/scheduler
$ mkdir -p $WORK/cron/cron.daily
$ mkdir -p $WORK/cron/cron.hourly
Now copy these two slurm script in $WORK/cron/scheduler
:
after editing those to fit your specific environment's account and partition information.
Now you can launch the crontab scheduler jobs:
$ cd $WORK/cron/scheduler
$ sbatch cron-hourly.slurm
$ sbatch cron-daily.slurm
This is it, these jobs will now self-perpetuate and usually you don't need to think about it again unless there is an even that makes SLURM lose all its jobs.
Now whenever you want some job to run once a day, you simply create a slurm job and put it into the $WORK/cron/cron.daily
dir.
Here is an example job that runs daily to update the mlocate
file index:
$ cat $WORK/cron/cron.daily/mlocate-update.slurm
#!/bin/bash
#SBATCH --job-name=mlocate-update # job name
#SBATCH --ntasks=1 # number of MP tasks
#SBATCH --nodes=1
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --time=1:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=PARTITION # edit me
#SBATCH --account=GROUP@PARTITION # edit me
set -e
date
echo "updating mlocate db"
/usr/bin/updatedb -o $WORK/lib/mlocate/work.db -U $WORK --require-visibility 0
This builds an index of the files under $WORK
which you can then quickly query with:
/usr/bin/locate -d $WORK/lib/mlocate/work.db pattern
To stop running this job, just move it out of the $WORK/cron/cron.daily
dir.
The same principle applies to jobs placed into the $WORK/cron/cron.hourly
dir. These are useful for running something every hour.
Please note that this crontab implementation is approximate timing-wise, due to various delays in SLURM scheduling they will run approximately every hour and every day. You can recode these to ask SLURM to start something at a more precise time if you have to, but most of the time the just presented method works fine.
Additionally, you can code your own variations to meet specific needs of your project, e.g., every-30min or every-12h jobs.
Finally, since every cron launcher job will leave behind a log file (which is useful if for some reason things don't work), you want to create a cronjob to clean up these logs. Otherwise you may run out of inodes - these logs files are tiny, but there could be tens of thousands of those.
You could use something like this in a daily job.
find $WORK/cron -name "*.out" -mtime +7 -exec rm -f {} +
Please note that it's set to only delete files that are older than 7 days, in case you need the latest logs for diagnostics.
The scheduler runs with Unix permissions of the person who launched the SLRUM cron scheduler job and so all other SLURM scripts launched by that cron job.
The same approach used in building a scheduler can be used for creating stand-alone self-perpetuating jobs.
For example:
#!/bin/bash
#SBATCH --job-name=watchdog # job name
#SBATCH --ntasks=1 # number of MP tasks
#SBATCH --nodes=1
#SBATCH --time=0:30:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=PARTITION # edit me
# ensure to restart self first 1h from now
RUN_FREQUENCY_IN_HOURS=1
sbatch --begin=now+${RUN_FREQUENCY_IN_HOURS}hour watchdog.slurm
... do the watchdog work here ...
and you launch it once with:
sbatch watchdog.slurm
This then will immediately schedule itself to be run 1 hour from the launch time and then the normal job work will be done. Regardless of whether the rest of the job will succeed or fail, this job will continue relaunching itself approximately once an hour. This is imprecise due to scheduler job starting overhead and node availability issues. But if there is a least one spare node available and the job itself is quick to finish the requirement to run at an approximate frequency should be sufficient.
As the majority of SLURM environment in addition to the expensive GPU nodes also provide much cheaper CPU-only nodes, you should choose a CPU-only SLURM partition for any jobs that don't require GPUs to run.
From within the slurm file one can access information about the current job's allocations.
Getting allocated hostnames and useful derivations based on that:
export HOSTNAMES=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
export NUM_NODES=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l)
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
Sometimes you get SLURM tools give you a string like: node-[42,49-51]
which will require some coding to expand it into node-42,node-49,node-50,node-51
, but there is a special tool to deal with that:
$ scontrol show hostnames node-[42,49-51]
node-42
node-49
node-50
node-51
Voila!
case study: this is for example useful if you want get a list of nodes that were drained because the job was too slow to exit, but really there is no real problem with the nodes. So this one-liner will give you the list of such nodes in an expanded format which you can then script to loop over this list to undrain these nodes after perhaps checking that the processes have died by this time:
sinfo -R | grep "Kill task failed" | perl -lne '/(node-.*[\d\]]+)/ && print $1' | xargs -n1 scontrol show hostnames
SLURM runs on Unix, but surprisingly its designers haven't adopted the concept of group ownership with regards to SLURM jobs. So if a member of your team started an array of 10 jobs 20h each, and went on vacation - unless you have sudo
access you now can't do anything to stop those jobs if something is wrong.
I'm yet to find why this is so, but so far we have been using a kill switch workaround. You have to code it in your framework. For example, see how it was implemented in Megatron-Deepspeed (Meg-DS). The program polls for a pre-configured at start up path on the filesystem and if it finds a file there, it exits.
So if we start Meg-DS with --kill-switch-path $WORK/tmp/training17-kill-switch
and then at any point we need to kill the SLURM job, we simply do:
touch $WORK/tmp/training17-kill-switch
and the next time the program gets to check for this file it'll detect the event and will exit voluntarily. If you have a job array, well, you will have to wait until each job starts, detects the kill switch and exits.
Of course, don't forget to remove it when you're done stopping the jobs.
rm $WORK/tmp/training17-kill-switch
Now, this doesn't always work. If the job is hanging, it'll never come to the point of checking for kill-switch and the only solution here is to contact the sysadmins to kill the job for you. Sometimes if the hanging is a simple case pytorch's distributed setup will typically auto-exit after 30min of preset timeout time, but it doesn't always work.
There are several ways to gracefully handle time- and QoS-based SLURM pre-emption which are covered indepth in this section: Dealing with forced job preemption.
To figure out how many gpus are used by an already running job, parse the JOB_GRES=gpu:
entry in show job -d
output. For example, if the job was started with:
srun --pty --partition=dev --nodes=2 --ntasks-per-node=1 --gres=gpu:8 --time=8:00:00 bash
that is we allocated 16 GPUs, we can now get that number back programmatically via:
$ TOTAL_JOB_GPUS=$(scontrol show job -d $SLURM_JOBID | perl -ne 'm|JOB_GRES=gpu:(\d+)| && print $1')
$ echo $TOTAL_JOB_GPUS
16
Replace $SLURM_JOBID
with the SLURM job id if it's not already set in the shell you run the command from (squeue
).
While normally squeue
will show you the duration of the currently running job, in order to see how long a job run for when it finished, you need to know the job id and then you can query it like so:
$ sacct -j 22171 --format=JobID,JobName,State,Elapsed
JobID JobName State Elapsed
------------ ---------- ---------- ----------
22171 example COMPLETED 00:01:49
so we know the job finished running in under 2min.
Many SLURM clusters use the FairShare system where the more someone uses the cluster the less of the priority they get to run jobs or if there is a pre-emption in place they are more likely to get pre-empted
To see your FairShare scores run:
sshare
Example:
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 0.000000 711506073 1.000000
all 1 0.500000 711506073 1.000000
all stas 1 0.022727 14106989 0.019827 0.288889
If your FairShare score is more than 0.5 that means you have been using the cluster less than what you have been allocated, if it's less than 0.5 it means you have been using more than what was allocated.
As the time passes this score gets decayed so if you were having a very low score and have you have been using the cluster much less then your score will raise over time.
To see the score of a specific user:
sshare -u username
To see everybody's scores, sorted by FairShare:
sshare --all | sort -nk7 -r
This is the most important output, since it doesn't really matter what your score is alone. What matters is your score relative to all other users. Everybody who has a higher score than you will have a higher chance at getting they job yielded first and a lower chance of getting their job preempted.
Besides FairShare the priorities are typically configured based on a combination of multiple metrics, usually including the length of time a job has been waiting in the queue, job size, Quality of Service (QOS) setting, partition specifics, etc. The specifics will depend on how the slurm has been configured by your sysadmin.