PBSPro Batch Scheduler – Cray

Introduction

The CRAY XC40 facility uses PBS (Portable Batch System) to schedule jobs. Writing a submission script is typically the most convenient way to submit your job to the job submission system. Interactive jobs are also available and can be particularly useful for developing and debugging applications. 

In view of the increasing complaints on less times available to regular queues due to high priority HIP queues and to achieve equitable share of resources across the queues of different priorities, a job starvation policy has been put in place since last week. By this policy, the allocations for high-priority queues will be no more than 1.5 days per week. Hence, you might notice the high-priority queues getting closed after reaching this threshold during a week .

The job submission system on CRAY XC40 generally works differently from one you may have used on other SERC HPC facilities. CRAY XC40 job submission scripts do not run directly on the compute nodes. Instead they run on Job Launcher Nodes. Job Launcher Nodes (also called MOM Nodes) are CRAY XC40 Service Nodes that have permission to issue the aprun command. The aprun command launches jobs on the compute nodes. This contrasts with most HPC job submission systems, where the job submission script runs directly on the first compute node selected for the job. Therefore, running jobs on CRAY XC40 requires care: avoid placing any memory or CPU intensive commands in job submission scripts as these could cause problems for other users who are sharing the Job Launcher Nodes. CPU and memory intensive commands should be run as serial jobs on the pre- and post-processing nodes (see below for details on how to do this).

There are three key commands used to interact with the PBS on the command line:

  • qsub
  • qstat
  • qdel

To execute the above commands, please load the below module

  • module load pbs

Check the PBS man page for more advanced commands:

man pbs

The qsub command

The qsub command submits a job to PBS:

  • qsub job_script.pbs

 

High priority queues 
These queues have higher priority than the regular queues. More information on these queues can be found in this page.

 

Regular Queues :

Queue Configuration for IISc users :

There are eight queues configured on the Cray XC40:

idqueue: This queue is meant for testing codes. Users can use this queue, if the number of processors is between 24-256. For this queue the walltime limit is 1 hour.

small:  Meant for production runs with core counts ranging from 24 – 1032 with max job walltime of 24hrs. Per user 3 jobs can be inthe queue including with two jobs in running state.

small72:  Meant for production runs with core counts ranging from 24 – 1032 with max job walltime of 72hrs. Per user 1 job can be in the queue including with a job in running state.

medium:  Meant for production runs with core counts ranging from 1033 – 8208 with max job walltime of 24hrs. Per user 1 job can be in the queue including with a job in running state.

large:  Meant for production runs of users with demonstrable scalable parallel codes; Allows core counts from 8209 – 24000 with max job walltime of 24 hrs. This queue is controlled by acls and will only allows jobs from authorised users. Per user 1 job can be in the queue including with a job in running state.

gpu:  Meant for production runs for CUDA codes with core counts ranging from 1-12 and one GPU on a node, with max job walltime of 24hrs. The queue also permits multi-node jobs.

mgpu:  Meant for production runs for CUDA codes with core counts ranging from 1-12 and one GPU on a node, with max job walltime of 24hrs. The queue is dedicated only for multi-node jobs which can be used between 5 to 24 GPU nodes.

knl :  Meant for production runs for Xeon-Phi codes.Users can request up to 12 nodes with max job walltime of 24hrs. Each KNL node has 64 physical cores and 256 logical cores. This queue also permits multi-node jobs between 1 to 12 nodes.

batch: This is the default queue in which all jobs are placed when submitted. The purpose of this queue is to route the jobs to the queue based on the parameters specified in the job script.

Note: Users cannot directly submit jobs to a particular queue. All the jobs are routed through batch.                                  User’s may please note that Intel Xeon-Phi (KNC) nodes are disabled after CLE6 upgrade. 

 

Queue Configuration for Non-IISc users :

There are three queues configured on the Cray XC40:

extcpu : Meant for production runs with core counts ranging from 240 – 1032 with max job walltime of 24hrs. Per user 1 job can be in the queue including with a job in running state.

extgpu : Meant for production runs for CUDA codes with core counts ranging from 1-48 and one GPU on a node, with max job walltime of 24hrs. The queue also permits multi-node jobs.Per user 1 job can be in the queue including with a job in running state.

extknl : Meant for production runs for Xeon-Phi codes. Users can request up to 4 nodes with max job walltime of 24hrs. Each KNL node has 64 physical cores and 256 logical cores. This queue also permits multi-node jobs between 1 to 4 nodes.Per user 1 job can be in the queue including with a job in running state.

Example Job Script

A sample scriptfile can be like this:
#!/bin/sh
#PBS -N jobname
#PBS -l select=10:ncpus=24 //select 10 compute nodes
#PBS -l walltime=24:00:00   //maximum walltime for a job to run
#PBS -l place=scatter
#PBS -l accelerator_type="None" 
//add the above line only for idqueue,small,small72,medium queue
#PBS -S /bin/sh@sdb -V	//Allow NIS users to submit/launch job
. /opt/modules/default/init/sh
cd {/path of executable}
#Launch the parallel job
aprun -j 1 -n 240 -N 24 ./name_of_executable 
//Using 240 MPI processes and 24 MPI processes per node
  

In the case of CRAY XC 40, job-specific parameters are defined at two levels. The #PBS directives are used for reserving nodes and aprun command options are used for initiaing job execution. The amount of computing resources to be used is specified in two ways:

  • You can reserve certain number of nodes (each having 24 computing cores) with PBS Pro option -select=<number of nodes>.
  • And specify the total number of cores to be used with the aprun option -n <number of nodes X number of cores per node>

Note: To ensure the minimum wait time for your job, you should specify a walltime as short as possible for your job (i.e. if your job is going to run for 3 hours, do not specify 24 hours). On average, the longer the walltime you specify, the longer you will queue for. By default, the walltime is 12 hrs if not specified.

Important Note:

Handling Output and Error files of your jobs on XC40

On large scale distributed systems like the Cray-XC40 the output and error files generated by your jobs need to be explicitly stated in your jobs scripts. Please use full path names to write these files to your area on the scratch space /mnt/lustre. Using relative paths or just file names in your job scripts pushes these files to PBSPro system spool area and fills that space. As a result the PBSPro scheduler no longer can dispatch jobs for execution and the batch scheduler stalls. This renders the system into a stall mode and every user gets affected and no job gets dispatched.

The general guideline for handling output and error files for your jobs is as follows:

  1.  Try using specific file descriptors other than stdout and stderr in your executables. These file descriptors can then be set to be written to /mnt/lustre space directly in your source code or through appropriate directives or environment variables of your opensource and licensed codes.

2.  The following #PBS directives should not be used/included in the job submission scripts

#PBS -o /mnt/lustre/..../output.log
#PBS -e /mnt/lustre/..../error.log
#PBS -j oe

3. While using the input/output redirection symbols(<,>), please make sure the absolute paths of your input,output and executable file location are specified and pointed to /mnt/lustre space.

A quick way to check the status of your SahasraT jobs from your local machine(Linux)

Users can use the following commands to check the job status quickly from their local machines (linux).
From the machines which have passwordless login enabled:
ssh -t username@sahasrat.serc.iisc.ernet.in "module load pbs && qstat -u username"

From the machines which don’t have passwordless login enabled:

sshpass -p 'password'ssh -t username@sahasrat.serc.iisc.ernet.in "module load pbs && qstat -u username"

Too lazy to execute the above commands every time? Add the following lines in your ~/.cshrc  or ~/.bashrc

alias jobstat='ssh -t username@sahasrat.serc.iisc.ernet.in "module load pbs && qstat -u username"'

Use the commandjobstat” to get the details of your jobs

Note:Users can edit the command according to their requirements. For example, use module load pbs && qstat -T -u username to get the estimated start time of your jobs.

 

Sample Job Scripts (For IISc Users):

Commonly used PBSPro commands

1.To check the status of the job

qstat -a or apstat

Gives the details of the job like the job number, the queue through which it was fired etc.

2.To remove a job from the queue

qdel <jobid>

3.To know about the available queues

qstat -q

Documentation:

Report Problems to:

If you encounter any problem in using ‘PBS Pro’ please report to SERC helpdesk at the email address helpdesk.serc@auto.iisc.ac.in or contact System Administrators in #103,SERC.