Automatic generation of fault-loads for testing recovery code Archana Ganapathi On a fault injection platform, it is beneficial to automatically generate fault loads. A call-graph can be constructed from static analysis of the recovery-code. Every function that belongs to the recovery-code appears as a node in this graph and a directed edge appears between caller and callee. The caller is a parent appearing at the head of an arrow whereas its child is the node appearing at its tail. A parent can have several children. This graph is expected to be acyclic as recursion, both direct and indirect, is expected to be absent in recovery code. For example consider the following fragment of recovery code: Recover() { if(error_case_1) Recover_using_handler1() else Recover_using_handler2() } Recover_using_handler1() { Recover_resource_X() } The call-graph is as follows: Recover --> Recover_using_handler1 --> Recover_resource_X | L---> Recover_using_handler2 Fault injection proceeds from a leaf node (a node that has no children) towards the root (a node that has no parent). In this leaf-to-root path traversal, fault containment for a parent is the composition of fault-containment for its children. At each node, we use static analysis to identify critical data that is read or updated by the function (ideally this set is the MOD (modified) and USE (used) data in interprocedural-analysis parlance [1]. Also, we track ownership by considering constraints constructed using static program analysis as in [4]. Interprocedural data-flow analysis on the call graph identifies potential corruption points. Critical data is now corrupted at various levels of abstraction using random bit flips, op-code corruption, false positives (function returns success although it failed to perform action completely or correctly), false negatives (function returns failure when successful completion occurred) as well as simulated resource exhaustion (e.g., lack of sufficient memory to complete action) [5, 3]. Transitively we construct a fault-library, i.e., if the call chain is A->B->C then the library of faults for A contains the library of faults for its child B as well as the library of faults for non-B executing paths in A; similar library entries are constructed for B and C. Exhaustive fault-load generation for A would consider the power-set of the immediate descendants of A in the call-graph to accommodate simultaneous/cascading child-node failures. A does not consider grand-descendents? failures as such events reflect on A's immediate descendant; either A's child handles its children's failure completely or fails in this process. Similar to FIG [2], functions in the fault-injection library intercept the corresponding function call from the recovery-code. The library members are exhaustively invoked for thorough testing. These tests are performed when the system is not operational to avoid clobber real-time recovery efforts. A major challenge for recovery code is that it requires recreating "unusual states" resulting from system failures. After injecting and observing a fault, we ensure that we recreate the exact same unusual state under which recovery code is invoked before injecting another fault. Each fault injection effort must be isolated from others (they adhere to the superposition/isolation principle), i.e. there exist no side effects of one fault injection task while injecting another task (unless modeling simultaneous/cascading faults). References: [1] Aho, Sethi & Ullman Compilers Principles, Techniques and Tools [2] Broadwell, P., N. Sastry and J. Traupman, FIG: A Prototype Tool for Online Verification of Recovery Mechanisms. Workshop on Self-Healing, Adaptive and self-MANaged Systems (SHAMAN), New York, NY, June 2002. [3] Mei-Chen Hsueh, Timothy K. Tsai, Ravishankar K. Iyer, Fault injection techniques and tools, IEEE Computer, April 1997. [4] Oplinger, J. & Lam, M.S. Enhancing Software Reliability using Speculative Threads, Proc. Conf. on Architectural Support for Programming Languages & Operating Systems, October 2002. [5] Stanford CS444a Lecture notes on Fault Injection http://cs444a.stanford.edu/slides/Injection.pdf