SLURM (Simple Linux Utility for Resource Management):
Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.
Commands for the User:
Difference Between PBS and SLURM
1. SBATCH – Submit a batch script to SLURM
sbatch <SCRIPT NAME>
2. SQUEUE – View information about jobs located in the SLURM scheduling queue. SQUEUE is used to view job and job step information for jobs managed by SLURM.
squeue <JOB ID>
3. SINFO – view information about SLURM nodes and partitions. SINFO is used to view partition and node information for a system running SLURM
sinfo
4. SCANCEL – Used to signal jobs or job steps that are under the control of SLURM.
scancel <JOB ID>
5. Kill Job –
kill <JOB ID>
1. | Queue Name: queue- hpq_2day_4G This queue is meant for high priority production runs on CUDA cores with 4-GPU. Walltime: 48 Hrs |
1. |
Queue Name: q_1day-1G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 1-GPU.
Walltime: 24 Hrs
|
2. |
Queue Name: q_2day-1G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 1-GPU.
Walltime: 48 Hrs
|
3. |
Queue Name: q_1day-2G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 2-GPU.
Walltime: 24 Hrs
|
4. |
Queue Name: q_2day-2G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 2-GPU.
Walltime: 48 Hrs
|
5. |
Queue Name: q_1day-4G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 4-GPU.
Walltime: 24 Hrs
|
6. |
Queue Name: q_2day-4G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 4-GPU.
Walltime: 48 Hrs
|
HPC Job-Script example
Sample-Jobscript for DGX-1
#!/bin/sh
#SBATCH --job-name=serial_job_test # Job name
#SBATCH --ntasks=1 # Run on a single CPU
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_job.out # Standard output and error log
#SBATCH --gres=gpu:1
#SBATCH --partition=q_1day-1G
pwd; hostname; date |tee result
nvidia-docker run -it ${USE_TTY} --name $SLURM_JOB_ID --user $(id -u $USER):$(id -g $USER) --rm -v </home or /localscratch path>:/workspace -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /etc/shadow:/etc/shadow nvcr.io/nvidia/pytorch:18.07-py3 python -c 'import torch; print(torch.__version__)'