Queue Configuration – SUPERCOMPUTER EDUCATION AND RESEARCH CENTRE

SLURM (Simple Linux Utility for Resource Management):

Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.

Commands for the User:

Difference Between PBS and SLURM

1. SBATCH – Submit a batch script to SLURM

sbatch <SCRIPT NAME>

2. SQUEUE – View information about jobs located in the SLURM scheduling queue. SQUEUE is used to view job and job step information for jobs managed by SLURM.

squeue <JOB ID>

3. SINFO – view information about SLURM nodes and partitions. SINFO is used to view partition and node information for a system running SLURM

sinfo

4. SCANCEL – Used to signal jobs or job steps that are under the control of SLURM.

scancel <JOB ID>

5. Kill Job –

kill <JOB ID>

Queue Configuration:

There are 6 regular queues configured on Nvidia DGX.

1.	Queue Name: q_1day-1G (Wall time = 24 Hrs) This queue is meant for production runs on CUDA cores with 1-GPU. *Walltime: 24 Hrs*
2.	Queue Name: q_2day-1G (Wall time = 48 Hrs) This queue is meant for production runs on CUDA cores with 1-GPU. *Walltime: 48 Hrs*
3.	Queue Name: q_1day-2G (Wall time = 24 Hrs) This queue is meant for production runs on CUDA cores with 2-GPU. Walltime: 24 Hrs
4.	Queue Name: q_2day-2G (Wall time = 48 Hrs) This queue is meant for production runs on CUDA cores with 2-GPU. Walltime: 48 Hrs
5.	Queue Name: q_1day-4G (Wall time = 24 Hrs) This queue is meant for production runs on CUDA cores with 4-GPU. *Walltime: 24 Hrs*
6.	Queue Name: q_2day-4G (Wall time = 48 Hrs) This queue is meant for production runs on CUDA cores with 4-GPU. *Walltime: 48 Hrs*

HPC Job-Script example

Sample-Jobscript for DGX-1

#!/bin/sh
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --time=24:00:00               # Time limit hrs:min:sec
#SBATCH --output=serial_test_job.out  # Standard output
#SBATCH --error=serial_test_job.err   # error log
#SBATCH --gres=gpu:1
#SBATCH --partition=q_1day-1G 
pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20G --user $(id -u $USER):$(id -g $USER) --rm -v /localscratch/<uid>:/workspace/localscratch/<uid> <preferred_docker_image name>:<tag> bash -c 'cd /workspace/localscratch/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt

##example for above looks like ( do not include these 2 highlighted lines in your script):
##docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20G --user $(id -u $USER):$(id -g $USER) --rm -v /localscratch/secdsan:/workspace/localscratch/secdsan secdsan_cuda:latest bash -c 'cd /workspace/localscratch/secdsan/gputestfolder/ && python gputest.py' | tee -a log_out.txt

Job Submission Instructions:

All jobs should be sumbitted via slurm.
If jobs are run without slurm, your actions will be notified to your professor and account will be blocked.
Get the sbatch script as above in a file, for job submission use the below command.
```
sbatch <SCRIPT NAME> 
```

example: sbatch test_script.sh