Self-Recovering Software Archana Ganapathi Automatic configuration management is a key property that software must possess in order to self-recover. It must be aware of "working configuration sets", particularly dynamic configuration changes that lead to erroneous states; essentially, periodic prophylactic configuration checks ensure correct system behavior. Sometimes, communication with peers is necessary to monitor environmental configurations to avoid incompatibility. Upon roll out of nascent software, a learning curve is in effect while the prior version is functional. This period is a good opportunity to monitor functional properties of peers as well as note valid/invalid configurations for the system and its constituent components. Once all configuration relevant peer data is obtained, the new software replaces its earlier counterpart. Updated configuration information and additional constraints, if any, must be communicated to other system components. This process ensures that dependencies are explicitly communicated to all peer software. What happens when system software components have conflicting configuration requirements? We perform a copy on write [2]. Each software component maintains its own copy of the configuration not relying on the "community copy". This shadow-copy prevents constant ping- pong during re-configurations between conflicting software. If software A updates its configuration set and software B finds the update incompatible, B sends a message informing A of incompatibility; consequently A creates its own copy. B's version is identical to the "community copy", preserving the functionality of the remainder of the system's community. If later, many other nodes express the need to upgrade to A's version of a configuration, the "community copy" is updated and B creates its own local copy. Each node incorporates a monitoring tool that considers event flow within the system. Events are messages for communication between various system components. For example, messages include exceptions, transactions and remote procedure calls [4]. The monitor checks rules that specify legal/illegal event sequences and matches patterns identifying erroneous configurations to perform the corresponding recovery action. For example, consider the following rule: EventA followed_by (EventB && !EventC) ==> Call ExceptionHandlerX(S.a, S.b) Upon matching the pattern on the left-hand-side, ExceptionHandlerX is invoked for error recovery. When a sequence of events is identified as dangerous, the monitor issues a warning to all relevant nodes in the system. If the sequence has not appeared before, the tool forces an undo to the checkpoint where the faulty sequence can be corrected. When a mis-configuration/failure occurs, each node that notices this failure records the flawed sequence of events that lead to an erroneous configuration. The node invokes an undo operation [1] to revert system state to a pre- failure state. Pertinent information for this event sequence is communicated to all peers in the system. The undo function, during repair/replay stage, remembers this faulty sequence to ensure it never reoccurs. Upon post-mortem analysis of the error sequence, the rules in the monitoring tool are updated to include this new event sequence and a plausible recovery action is associated. Statistical Machine Learning techniques are an indispensable tool for the post-mortem analysis as well as the initial learning phase of the software. References: [1] Brown, A. and D. A. Patterson. Rewind, Repair, Replay: Three R's to Dependability. 10th ACM SIGOPS European Workshop, Saint-Emilion, France, September 2002. [2] Donald D. Chamberlin et. al. A History and Evaluation of System R. Communications of the ACM, Vol. 24, No. 10, October 1981. [3] Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem. Checking system rules using system-specific, programmer-written compiler extensions. In Proc. 4th USENIX Symp. on Operating Systems Design and Implementation, San Diego, CA, Oct 2000. [4] Johan Moe and David A. Carr. Understanding Distributed Systems via Execution Trace Data. In Proc. Ninth International Workshop on Program Comprehension, Toronto, Canada, May 12-13, 2001.