Semantic Fault Localization

With computers becoming all pervasive, dependence on computing systems is also growing. Today, more and more tasks are being automated and deployed on systems that are managing large enterprise business processes. With organizations growing at a rapid pace and located at different geographical locations, enterprise setup enabling collaborations on a day to day basis among these various distributed entities is becoming a pressing need. Managing large systems and the supported business process environments manually is a humungous task. More so when it comes to identifying faults at runtime and reacting to them with the aim of providing uninterrupted services. The other dimension to this scenario is identifying the affected service/sessions/requests and rebuilding or restarting them with the aim of providing transparency to failures. The key to this is building a system or a service that is adaptive to faults in the underlying setup.

In order to build fault adaptive systems we need mechanisms that are proactive to faults. For this we need to have:

A mechanism to predict and identify faults occurring at runtime.
A mechanism to expose the predicted or identified faults in a uniform way.
A framework that can use the information above to enable building of fault adaptive systems.

In a running system the symptoms or signatures of faults are reflected in the error event messages generated by the system. These messages can tend to be very large in number and located in highly distributed way in cases of distributed environments. In order to make runtime decisions for localizing faults, it is first necessary to identify that set of error event messages, which identify with the fault. We believe that a fault is identified when the respective component is exercised in a specific way that is related to its designed function. Also, inter-component interactions are governed by the way a system is designed for its intended functionality. While a system is being used, the usage can trigger detection of faulty behavior in the system or its components, which are exhibited in the form of groups of related error event messages. It is these groups that we need to identify to be able to localize faults and for this we plan to exploit the functional semantics among various interacting components.

In this effort we propose to explore faults occurring at hardware of a system and various software layers from operating system to middleware to application layer in a distributed environment. The approach we take to localize faults is by exploiting the knowledge of functional semantics of various interacting components. At each layer, we identify the interacting components and the functional semantics of the associated interactions and use this information in fault localization.

Current Research:

Hardware Fault Localization in a single system.
Software Fault Localization with respect to an application within a single system.
Software Fault localization with respect to an application running on distributed systems.
Framework for building fault adaptive systems.

Publications:

K C Nainwal, J. Lakshmi, S K Nandy, Ranjani Narayan and K Varadarajan, “A framework for QoS Adaptive Grid Meta Scheduling“, proceedings of HADIS 2005 (First International Workshop on High Availability of Distributed Systems), Copenhagen, Denmark, August 2005.

J Lakshmi, S K Nandy, Ranjani Narayan and Keshavan Varadarajan, “Framework for Enabling Highly Available Distributed Applications for Utility Computing“, ISPA 2006 549-560, Parallel and Distributed Processing and Applications: 4th International Symposium, ISPA 2006, Sorrento, Italy, December 4-6, 2006. Proceedings.

Technical Reports:

J Lakshmi, S K Nandy, Ranjani Narayan and Keshavan Varadarajan, “Aids to Pro-active Management of Distributed Resources through Dynamic Fault-Localization and Availability Prognosis“, FaultLocalization-TR01-CADlab, May 2006 (PDF).