Quick guide to setting up and running jobs on a slurm cluster (specifically at ozerlabs).
Presumably you already have access and can remotely connect so the cluster. By running the following command you should be able to connect to the cluster. replace <username>
with your username and <cluster_ip>
with the ip address of the cluster.
ssh <username>@<cluster_ip>
You can either request an interactive session or submit a job to the cluster. Interactive sessions will give you a terminal through which you can run your commands using requested resources. Submitting a job will run your task in the background once when the requested resources are aquired.
- Interactive session
sinteractive -N 1 -n 1 --nodelist=nodename --gres=gpu:1 -J int_jobs_name
- Submit a job
sbatch <script_name>.sh
There are existing modules already installed including anaconda, cuda, that can be loaded to your environment.
you can check the modules and load them with:
# check available modules
module avail
# load a module
module load <module_name>
Although some module are installed they might not be sufficient for your task at hand. As an example, separate python environments will certainly be needed for different tasks. Therefore we will have Miniconda downloaded and installed to work with environments with ease and more flexibility.
- download and install Miniconda
mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh
- Initialize conda
~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh
- create a new environment
# conda create -n <env_name> python=3.7 ...
CifarTuts is a simple pyTorch example that trains a CNN on the CIFAR-10 dataset. You can follow the steps below to create a new environment and run train.py, test.py and predict.py on the cluster.
- environment setup
conda create -n cifarTuts
conda activate cifarTuts
# install pytorch
pip3 install torch torchvision torchaudio
# install other dependencies
# pip3 install -r requirements.txt
- Submit the job
cd cifarTuts
sbatch slurmJob.sh
You can run any examples provided in this repository by cd'ing into the directory and running submit the job to the cluster.
sbatch <script_name>.sh