Torque Configuration on Fermi

Introduction

Fermi cluster is configured to allow parallel batch jobs built to use GPGPUs. Fermi cluster consists of five GPU nodes fermi1, fermi2, fermi3, fermi4 and fermi5. Node fermi1 to fermi4 have Nvidia Tesla C2070 GPU card, with 448 GPUs on each card and fermi5 has 3 Nvidia Tesla M2090 GPU card, with 512 GPUs on each card. To manage batch jobs the cluster uses Torque software, from opensource. Torque version 3.0.2 is installed on the fermi cluster. Users intending to run GPU based parallel codes are expected to submit jobs to this cluster. GPU specification of the job needs to be mentioned as a node attribute in the job script, as illustrated in the sample job script below. As of now, each job can use a maximum of one GPU card, i.e. a total of 448 GPGPUs, per node(fermi1 – fermi4) or maximum of 3 GPU cards (each card with 512 GPUs) on fermi5. Users expecting to use multiple GPU cards have to specify requisite number of nodes or GPUs accordingly in job scripts.

How to use Torque

Users can log on to fermi1, use qsub command to submit their jobs.

Environmental setup

For c shell users :

Add the following lines in your .cshrc file

set path=(/usr/bin $path)
set path=(/usr/sbin $path)

run the command source .cshrc

For bash shell users:

Add the following lines in your .bashrc file

export PATH=/usr/bin :$PATH
export PATH=/usr/sbin :$PATH

run the command source .bashrc

Queue Configuration

There are two queues configured on the fermi cluster :  routeq and batch

routeq:This is the default queue in which all jobs are placed when submitted. The purpose of this queue is to route the jobs to the queue based on the parameters specified in the job script.

batch queue:Once a job is placed in this queue, it is sent for execution depending on the available free resources on the cluster. For this queue the default walltime is 24 hrs.

Note: Users cannot directly submit jobs to a particular queue. All the jobs are routed through routeq

How to submit jobs

To submit jobs on the cluster, users must use qsub command. The job executable is expected to be built using the installed CUDA libraries on this cluster. The general procedure for using the cluster is to first build your CUDA application, create job submission script and submit it using qsub.

A sample job submission script for fermi1 – fermi4 as below:

#!/bin/bash
#PBS -N testn
#PBS-l nodes=1:ppn=x:gpus=1
cd  /path_of_executable 
./hello > /localscratch/login-id/hello.out

mv /localscratch/login-id/hello.out /path/hello.out

A sample job submission script for fermi5 as below:

#!/bin/bash
#PBS-N testn
#PBS-l nodes=1:ppn=x:gpus=y:M2090
cd  /path_of_executable
./hello > /localscratch/login-id/hello.out
mv /localscratch/login-id/hello.out /path/hello.out

Here,

hello : The executable file
login-id : Your login-id
hello.out: Name of your output file
path : Specify the output file path

The user can change the values of ppn and gpus

nodes : number of nodes on which to execute the job. For a simple CUDA application using one GPU card nodes=1; in case an application intends to use multiple GPU cards, nodes valuecan be in between 2 – 4 for this cluster(fermi1 – fermi4).
gpus : number of gpus=1 (for fermi1 – fermi4) and gpus=1 to 3 (for fermi5).
ppn : number of processors per node ppn=1 to 4 (for fermi1 – fermi4) and ppn=1 to 16 (for fermi5).

Note:
Local scratch /localscratch/<loginid> is available for Job runtime use. Files older than 10 days in this area will be deleted. Please do not install any software in this area.

CUDA Compilation Path for nvcc:

For c shell users :

Add the following lines in your .cshrc file

set path=(/usr/local/cuda/bin $path)
set path=(/usr/local/cuda-5.0/bin $path)

run the command source .cshrc

For bash shell users :

Add the following lines in your .bashrc file

export PATH=/usr/local/cuda/bin :PATH
export PATH=/usr/local/cuda-5.0/bin :PATH

run the command source .bashrc

To Submit the Job Using Torque:

qsub <submit.sh>

The user can change submit.sh to any other name he would want.

Torque/PBS Commands

1. To check the status of the job

qstat -a

Gives the details of the job like the job number, the queue through which it was fired etc.

2. To remove a job from the queue

qdel <jobid>

3. To know about the available queues

qstat -q

If you encounter any problem in using Torque please report to SERC helpdesk at the following email address helpdesk.serc@iisc.ac.in or contact System Administrators in 109 (SERC).