Queue Configuration

SLURM (Simple Linux Utility for Resource Management):

Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.

 

Commands for the User:

  Difference Between PBS and SLURM

1.  SBATCH – Submit a batch script to SLURM

sbatch <SCRIPT NAME> 

2. SQUEUE – View information about jobs located in the SLURM scheduling queue.  SQUEUE is used to view job and job step information for jobs managed by SLURM.

squeue <JOB ID>

3. SINFO – view information about SLURM nodes and partitions. SINFO is used to view partition and node information for a system running SLURM

sinfo 

4. SCANCEL – Used to signal jobs or job steps that are under the control of SLURM.

scancel <JOB ID>

5. Kill Job

kill <JOB ID>
Queue Configuration:
There are 6 queues configured on Nvidia DGX.
1.
Queue Name: q_1day_1G (Wall time = 24 Hrs)
 This queue is meant for production runs on CUDA cores with 1-GPU.
  Walltime:  24 Hrs
2.
Queue Name: q_2day_1G (Wall time = 48 Hrs)
 This queue is meant for production runs on CUDA cores with 1-GPU.
  Walltime:  48 Hrs
3.
Queue Name: q_1day_2G (Wall time = 24 Hrs)
 This queue is meant for production runs on CUDA cores with 2-GPU.
 Walltime:  24 Hrs
4.
Queue Name: q_2day_2G (Wall time = 48 Hrs)
 This queue is meant for production runs on CUDA cores with 2-GPU.
 Walltime:  48 Hrs
5.
Queue Name: q_1day_4G (Wall time = 24 Hrs)
 This queue is meant for production runs on CUDA cores with 4-GPU.
 Walltime:  24 Hrs
6.
Queue Name: q_2day_4G (Wall time = 48 Hrs)
 This queue is meant for production runs on CUDA cores with 4-GPU.
 Walltime:  48 Hrs

HPC Job-Script example

 

 

 

Sample-Jobscript for DGX-1 

#!/bin/sh
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1         # Run on a single CPU
#SBATCH --time=24:00:00  # Time limit hrs:min:sec
#SBATCH --output=serial_test_job.out # Standard output and error log
#SBATCH --gres=gpu:1
#SBATCH --partition=q_1day-1G 
pwd; hostname; date |tee result
nvidia-docker run -it ${USE_TTY} --name $SLURM_JOB_ID --user $(id -u $USER):$(id -g $USER) --rm -v </home or /localscratch path>:/workspace -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /etc/shadow:/etc/shadow nvcr.io/nvidia/pytorch:18.07-py3 python -c 'import torch; print(torch.__version__)'