Load Sharing Facility (LSF)

  • The LIM on each server host monitors its host’s load and exchanges load information with other LIMs. The LIM on one host in the cluster acts as the master, collects information for all hosts and provides that information to the applications.
Jobs may be suspended to prevent overloading hosts. When the host is no longer overloaded, suspended jobs should continue running. When the job is suspended by the ownerm himself and then restarted, the job does not start immediately to prevent overloading. Instead the job’s state is changed from USUSP (user suspended) to SSUSP (system suspended) job. This job is resumed when the host load levels are within the scheduling thresholds just as if the job were suspended because of high load.

For further assistance, please contact helpdesk@serc.iisc.in by E-mail or contact system administrators in SERC#109.

LSF stands for LOAD SHARING FACILITY. LSF manages, monitors, and analyzes the workload for a heterogeneous network of computers and it unites a group of computers into a single system to make better use of the resources on a network. Hosts from various vendors can be integrated into a seamless system.

LSF is based on clusters. A cluster is a group of hosts. The clusters are configured in such a way that LSF uses some of the hosts in the cluster as batch server hosts and some others as client hosts.

In SERC, LSF has been loaded on CompaqAlphaServer ES40 systems .

Configuration Information of LSF at SERC

LSF Version : LSF 4.1

lsf-common-compalpes40: Includes four COMPAQ AlphaServer ES40 systems. To use LSF on the Compaq AlphaServer ES40 systems, users have to logon to server: alphas4 and then submit jobs through LSF.

The paths to be included to access the binaries and the man pages for LSF on Compaq Alpha ES40 systems are :

PATH : /usr/lsf4.1/bin

MANPATH : /usr/lsf4.1/mnt/man.

Queue Information
Jobs are submitted through queues. The queues configured on Compaq Alpha Server ES40 are:

8hr : Jobs that require 8 hours or less of CPU time can be submitted to this queue.
16hr : Jobs that require 16 hours or less of CPU time can be submitted to this queue.
32hr : Jobs that require 32 hours or less of CPU time can be submitted to this queue.
64hr : Jobs that require 64 hours or less of CPU time can be submitted to this queue.
128hr : Jobs that require 128 hours or less of CPU time can be submitted to this queue.
256hr : Jobs that require 256 hours or less of CPU time can be submitted to this queue.
Unlimited : Jobs that require more than 256 hours of CPU time should be submitted to this queue.
g98_q : All gaussian jobs to be submitted through this queue.

LSF Tools
LSF provides a set of tools for users to get information about the system.

  • lsid– gives the version of LSF, name of the load sharing cluster
    and the current master host.

    • Available resource names in the system
    • Available host types
    • Available host models
  • lshosts– displays configuration information about hosts
  • lsload– prints out current load information
    • lsload -l– shows the load thresholds
  • lsmon – updating display of load information
  • xlsmon – provides a Motif graphic display of host status and load levels in your LSF cluster
Some Basic LSF Commands
    • To get information about the hosts that are batch server hosts

bhosts

    • To submit a job

bsub “<job to be submitted>”

    • To submit a job in a specific queue

bsub -q <name of the queue> “<job to be submitted>”

    • To get details of the queues configured

bqueues

    • To get the status of the jobs submitted

bjobs

  • To get the status of the jobs submitted and jobs that finished recently

bjobs -a

  • To kill a job

bkill <JOBID>
Jobs can also be submitted using the GUI application “xbsub“.
The other GUI applications are:

  • lsbatch: used to monitor the host, job and queue status. It can also be used to control your jobs.
  • xlsmon: displays host status, load levels, load history and LSF cluster configuration information.
User Manuals
    • Man pages (accessed with man command) are available for all commands on the respective systems.
    • Online help is available through the Help menu for the xlsbatch, xmod, xbsub, xlsadmin applications.
Detailed Information on LSF

LSF stands for LOAD SHARING FACILITY. LSF requires a UNIX operating system with Internet Protocol (IP) networking. It is a general purpose distributed computing system. LSF is a suite of workload management products. LSF manages, monitors, and analyses the workload for a heterogeneous network of computers. It unites a group of computers into a single system to make better use of the resources on a network.

Load sharing in LSF is based on clusters. A cluster is a group of hosts that provide shared computing resources. A cluster can contain a mixture of host types. Each cluster has at least one LSF administrator who has permission to change the LSF configuration and perform other maintenance functions. LSF allows the user to use these hosts transparently, so applications that run on only one host type are available to the entire cluster.

It is designed for networks where all hosts have shared file systems. But it can be used even in networks without file sharing but with less fault tolerance capabilities.

LSF can automatically select hosts in a heterogeneous environment based on the current load conditions and the resource requirements of the applications.

LSF can run batch jobs automatically when required resources become available, or when systems are lightly loaded. LSF maintains full control over the jobs, including the ability to suspend and resume the jobs based on load conditions.

LSF supports sequential and parallel applications running as batch jobs. It allows new distributed applications to be developed through C program library and a tool kit of programs for writing shell scripts.

LSF treats each UNIX process queue as a separate machine. A multiprocessor computer with a single process queue is considered a single machine. A box full of processors that each have their own process queues is treated as a group of separate machines.

LSF allows fair share policies to be defined at the queue level so that different queues may have different sharing policies. The policy applies to all hosts used bythe queue. Fair share scheduling is an alternative to the default first-come- first-serve scheduling. This divides the processing power of the LSF cluster among users and groups to provide access to resources for all jobs in a queue.

Most applications can use the load sharing utilities to access LSF. They do not communicate directly with LSF and do not need to be modified to work with LSF. Nearly all UNIX commands and third party applications can be load shared using LSF utilities.

With LSF users can do their jobs and leave the system to find the best host to run their programs. Users are no longer limited to the resources on their own workstations. Users only need to learn a few simple commands to have the resources of the entire network within their reach, even without rewriting or changing their programs. Users can transparently run software that is not available on their local hosts. For example, a CAD tool available on a HP host can be run by a user on a SUN workstation without any difficulty. Users can write their own load sharing applications, both as shell scripts using the lstools programs and as compiled programs using the LSF application programming libraries.

LSF provides comprehensive resource and load information about all hosts in the network.

Resource Information:

  • Number of processors on each host
  • Total physical memory available to user jobs
  • The type
  • Model
  • Relative speed of each host
  • Special resources available on each host
  • The time windows when a host is available for load sharing

Dynamic Load Information:

  • CPU load
  • Available real memory
  • Available swap memory
  • Paging activity
  • I/O activity
  • Space in the /tmp directory
  • Arbitrary site defined load indices

LSF divides jobs into two kinds – interactive and batch.

  • Interactive jobs are not supported here on any system.
  • Batch processing can provide more efficient execution of resource intensive jobs. Batch jobs are kept on a list of jobs called queue. The batch system runs jobs from the queue when the appropriate resources are available. By making sure that every job has the resources it needs, the resource intensive jobs can be processed more efficiently. LSF allows the batch queues also to have access to all the hosts in your network. Your job can run as soon as any suitable host becomes available. You need not worry about hunting around your network to find a suitable, idle host.
LSF has a number of features to support fault tolerance. It is designed to continue operating even if some of the hosts in the cluster are unavailable. LSF services are available as long as any host in the cluster is up. When a host crashes, all jobs running on that host are lost but no other jobs are affected. However, when it comes up again the jobs that were running are assumed to have exited and an email is sent to the user, but the pending jobs remain as they are and are scheduled as hosts become available. Important jobs can be submitted to lsbatch with an option to automatically restart if the job is lost because of host failure.
A server host is a host that runs load-shared jobs. The Load Information Manager(LIM) runs on every server host. The LIM interfaces directly with the underlying operating systems and provide users with a uniform, host independent environment.