Queue Configuration

SLURM (Simple Linux Utility for Resource Management):

Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.

Commands for the User:

🔗  Difference Between PBS and SLURM

1.  SBATCH – Submit a batch script to SLURM

sbatch <SCRIPT NAME> 

2. SQUEUE – View information about jobs located in the SLURM scheduling queue.  SQUEUE is used to view job and job step information for jobs managed by SLURM.

squeue <JOB ID>

3. SINFO – view information about SLURM nodes and partitions. SINFO is used to view partition and node information for a system running SLURM

sinfo 

4. SCANCEL – Used to signal jobs or job steps that are under the control of SLURM.

scancel <JOB ID>

SMALL QUEUES:

  • Queue Name: q1m_2h-1G (Wall time = 2Hrs)

This queue is meant for production runs on CUDA cores with 1-GPU.

           Walltime: 1 Min to 2 Hrs

  • Queue Name: q1m_2h-2G (Wall time = 2Hrs)

        This queue is meant for production runs on CUDA cores with 2-GPU,

           Walltime: 1 Min to 2 Hrs

  • Queue Name: q1m_2h-4G (Wall time = 2Hrs)

          This queue is meant for production runs on CUDA cores with 4-GPU,

           Walltime: 1 Min to 2 Hrs

MEDIUM QUEUES:

  • Queue Name: q2h_12h-1G (Wall time = 12Hrs)

       This queue is meant for production runs on CUDA cores with 1-GPU.

           Walltime: 2Hrs to 12Hrs

  • Queue Name:q2h_12h-2G (Wall time = 12Hrs)

        This queue is meant for production runs on CUDA cores with 2-GPU

            Walltime: 2Hrs to12Hrs

LARGE QUEUES:

  • Queue Name: q12h_24h-1G (Wall time = 24Hrs)

         This queue is meant for production runs on CUDA cores with 1-GPU.

            Walltime: 12Hrs to 24 Hrs

  • Queue Name: q24h_48h-1G (Wall time = 48Hrs)

          This queue is meant for production runs on CUDA cores with 1-GPU

             Walltime: 24Hrs to 48Hrs

Reservation Queue :

  •  Queue Name: qreserve

              2 days of reservation time, every 25 days: For jobs with longer execution time and/or larger number of GPUs than the above.

             Walltime: Up to 48Hrs

🔗HPC Job-Script example

Sample-Jobscript for DGX-1 

#!/bin/sh
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1         # Run on a single CPU
#SBATCH --time=00:05:00  # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.out # Standard output and error log
#SBATCH --gres=gpu:1
#SBATCH --partition=q1m_2h-1G 
pwd; hostname; date | tee result
nvidia-docker run -t ${USE_TTY} --name $SLURM_JOB_ID --user $(id -u $USER):$(id -g $USER) --rm -v </home or /localscratch path>:/workspace -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /etc/shadow:/etc/shadow nvcr.io/nvidia/pytorch:18.07-py3 python -c 'import torch; print(torch.__version__)'