SLURM (Simple Linux Utility for Resource Management):
Slurm is a highly configurable open source workload and resource manager. In its simplest configuration, Slurm can be installed and configured in a few minutes. Use of optional plugins provides the functionality needed to satisfy the needs of demanding HPC centers with diverse job types, policies and work flows. Advanced configurations use plug-ins to provide features like accounting, resource limit management, by user or bank account, and support for sophisticated scheduling algorithms.
Commands for the User:
Difference Between PBS and SLURM
1. SBATCH – Submit a batch script to SLURM
sbatch <SCRIPT NAME>
2. SQUEUE – View information about jobs located in the SLURM scheduling queue. SQUEUE is used to view job and job step information for jobs managed by SLURM.
squeue <JOB ID>
3. SINFO – view information about SLURM nodes and partitions. SINFO is used to view partition and node information for a system running SLURM
sinfo
4. SCANCEL – Used to signal jobs or job steps that are under the control of SLURM.
scancel <JOB ID>
5. Kill Job –
kill <JOB ID>
1. |
Queue Name: q_1day-1G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 1-GPU.
Walltime: 24 Hrs
|
2. |
Queue Name: q_2day-1G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 1-GPU.
Walltime: 48 Hrs
|
3. |
Queue Name: q_1day-2G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 2-GPU.
Walltime: 24 Hrs
|
4. |
Queue Name: q_2day-2G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 2-GPU.
Walltime: 48 Hrs
|
5. |
Queue Name: q_1day-4G (Wall time = 24 Hrs)
This queue is meant for production runs on CUDA cores with 4-GPU.
Walltime: 24 Hrs
|
6. |
Queue Name: q_2day-4G (Wall time = 48 Hrs)
This queue is meant for production runs on CUDA cores with 4-GPU.
Walltime: 48 Hrs
|
HPC Job-Script example
Sample-Jobscript for DGX-1
#!/bin/sh
#SBATCH --job-name=serial_job_test # Job name
#SBATCH --ntasks=1 # Run on a single CPU
#SBATCH --time=24:00:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_job.out # Standard output
#SBATCH --error=serial_test_job.err # error log
#SBATCH --gres=gpu:1
#SBATCH --partition=q_1day-1G
pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20G --user $(id -u $USER):$(id -g $USER) --rm -v /localscratch/<uid>:/workspace/localscratch/<uid> <preferred_docker_image name>:<tag> bash -c 'cd /workspace/localscratch/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt
##example for above looks like ( do not include these 2 highlighted lines in your script):
##docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20G --user $(id -u $USER):$(id -g $USER) --rm -v /localscratch/secdsan:/workspace/localscratch/secdsan secdsan_cuda:latest bash -c 'cd /workspace/localscratch/secdsan/gputestfolder/ && python gputest.py' | tee -a log_out.txt
Job Submission Instructions:
- All jobs should be sumbitted via slurm.
- If jobs are run without slurm, your actions will be notified to your professor and account will be blocked.
- Get the sbatch script as above in a file, for job submission use the below command.
sbatch <SCRIPT NAME>
example: sbatch test_script.sh