- Introduction
The RNC facility uses OpenPBS (Portable Batch System) to schedule jobs. Writing a submission script is typically the most convenient way to submit your job to the job submission system. Interactive jobs are also available and can be particularly useful for developing and debugging applications.
OpenPBS is a distributed workload management software that provides a unified batch queuing and job management interface to a set of computing resources. OpenPBS is responsible for resource management, job scheduling, supercomputer optimization, message passing programming, parallel computation and distributed high performance computing.
OpenPBS provides many features and benefits to both the computer system user and to the organization.
Parallel Job Support works with parallel programming libraries such as MPI, PVM and
HPF. Applications can be scheduled to run within a single multi-processor computer or across multiple systems.
Job-Interdependency enables the user to define a wide range of inter-dependencies between jobs. Such dependencies include execution order, and execution conditioned on the success or failure of another specific job (or set of jobs).
Automatic Load-Leveling provides numerous ways to distribute the workload across a cluster of machines, based on hardware configuration, resource availability, keyboard activity, and local scheduling policy.
Distributed Clustering allows users to utilize physically distributed systems and clusters, even across wide-area networks.
How OpenPBS Works
PBS consists of two major components: System processes and Commands.
Commands: OpenPBS supplies both command line programs and a graphical interface. These are used to submit, monitor, modify, and delete jobs. There are three command classifications: user commands, which any authorized user can use, operator commands, and manager (or administrator) commands which require administrative privileges.
Server: The server process is the central component for OpenPBS. The server’s main job is to provide the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job. Typically there is one server managingm a given set of resources.
Job Executor(MOM): This process actually places the job into execution. It is called MOM as it is Mother of all executing jobs.MOM places a job into execution when it receives a copy of the job from a server. MOM also has the responsibility fo returning the job’s output to the user when directed to do so by the server.one MOM runs on each computer which will execute the PBS jobs.
Scheduler: The scheduler implements policy controlling when each job is run and on which resources. The Scheduler communicates with the various MOMs to query the state of system resources and with the Server to learn about the availability of jobsto execute.
How to use OpenPBS
Users can log on to RNC, use the required PBS commands to submit their jobs.
Environmental setup
For c shell users
Add the following lines in your .cshrc file
set path=($path /opt/pbs/bin) set path=($path /opt/pbs/sbin) |
run the command source .cshrc
For bash shell users
Add the following lines in your .bashrc file
export PATH=$PATH:/opt/pbs/bin export PATH=$PATH:/opt/pbs/sbin |
run the command source .bashrc
Queue Configuration:
There are Seven Regular queues configured on the RNC:
Below are the pbs queue policy:
Sl.No. |
Description |
Job Limits |
|||
User | Overall |
||||
Running |
Queue | Running | Queue | ||
1. | Queue Name: qreg_1day_small This queue is meant for production runs on 1-127 cores Walltime: 24 Hrs Sample Job Script |
5 | 5 | 75 | 25 |
2. | Queue Name: qreg_1day_med This queue is meant for production runs on 96-128 cores Walltime: 24 Hrs Sample Job Script |
1 | 1 | 10 | 10 |
3. | Queue Name: qreg_1day_large This queue is meant for production runs on 96-128 cores Walltime: 24 Hrs Sample Job Script |
1 | 1 | 10 | 10 |
4. | Queue Name: qreg_3day_small This queue is meant for production runs on 1-128 cores Walltime: 72 Hrs Sample Job Script |
2 | 2 | 10 | 10 |
5. | Queue Name: qgpu_1day This queue is meant for production runs 1 node GPUs Walltime: 24 Hrs Sample Job Script |
1 | 1 | 5 | 5 |
6. | Queue Name: qhm_1day This queue is meant for production runs on High Memory Walltime: 24 Hrs Sample Job Script |
1 | 1 | 5 | 5 |
7. | Queue Name: qssd_1day_small This queue is meant for production runs on 1-127 cores Walltime: 24 Hrs Sample Job Script |
1 | 1 | 10 | 10 |
8 | Queue Name: qssd_1day_large This queue is meant for production runs on 96 – 128 cores Walltime: 24 Hrs Sample Job Script |
1 | 1 | 10 | 10 |
Minimum: 1-core executions
Maximum: 128-core executions
Node Classification
Click here to get the Nodes Usage Statistics in Graphical view.
Sl. No. | Node Name | Queues can be used in the script | Maximum No. of Nodes to be used in the Job script |
1 | Compute node | qreg_1day_small qreg_1day_med qreg_1day_large qreg_3day_small |
29 Nos |
2 | High Memory Node | qhm_1day | 1 Nos |
3 | SSD Node | qssd_1day_small qssd_1day_large |
8 Nos |
4 | GPU Node | qgpu_1day | 1 Nos |
JOB scheduler commands:-
- submit the job to the scheduler by :- $qsub <script_name>
- check the jobs status by :- $qstat
- check where the job are running :- $qstat -n
- check full information of the job :- $qstat -f <job_id>
- delete the job from the queue :- $qdel $job_id (it may take 5 to 10 seconds)
- delete the job fore :- $qdel -W force <job ID>
- check the queue information :- $qstat -Q
- List all jobs and their state. :- $qstat -a
- .List all running jobs :- $qstat -r
- .List detail information on job :- $qstat -f <job_id>
Get information on your Cluster:-
- .List offline and down nodes in the cluster :- pbsnodes -ln
- List information on every node in the cluster :- pbsnodes -a
- List information of all nodres with available and used resource :- pbsnodes -aSj
PBS Environment Variables:-
Environment Variable | Description |
PBS_JOBNAME | User specified job name |
PBS_ARRAYID | Job array index for this job |
PBS_GPUFILE | List of GPUs allocated to the job located 1 per line:<host>-gpu<number> |
PBS_O_WORKDIR | Job’s submission directory |
PBS_TASKNUM | Number of tasks requested |
PBS_O_HOME | Home directory of submimng user |
PBS_JOBID | Unique pbs job id |
PBS_NUM_NODES | Number of nodes allocated to the job |
PBS_NUM_PPN | Number of procs per node allocated to the job |
PBS_O_HOST | Host on which job script is currently running |
PBS_QUEUE | Job queue |
PBS_NODEFILE | File containing line delimited list on nodes allocated to the job |
PBS_O_PATH | Path variable used to locate executables within job script |
How to use module:-
- Check available module :- $module avail
- Check loaded modules :- $module list
- Load the module :- $module load
- Unload one module :- $module unload
- unload all modules :- $module purge
Documentation:
OpenPBS Quick Tutorial
Report Problems to:
If you encounter any problem in using ‘OpenPBS’ please report to SERC helpdesk at the email address helpdesk.serc@auto.iisc.ac.in or raise a ticket at Helpdesk portal with this link
NOTE :
“Running jobs without PBS will lead to blocking of the computational account”
” Please note that the “/localscratch/” space is meant for saving your job outputs for a temporary period only. The localscratch space data older than 14 days (2 Weeks) will be deleted. SERC does not maintain any backups of the localscratch space data, and hence will not be responsible for any data loss after the data deletion. |