The Theory, Performance, and Limitations of Generic Recovery

Dave Lowell
Compaq Western Research Lab

Dealing with failures by the hardware, OS, or application is unpleasant for both the user and programmer. As a result, the lure of operating systems that transparently handle all failures is quite powerful. In this talk, I will explore the OS abstraction of "failure transparency" in which the OS generates the illusion of failure-free operation. To provide this illusion, the operating system must handle all hardware, software, and application failures to keep them from affecting what the user sees. Furthermore, the OS must do so without help from the programmer and without unduly slowing failure-free operation. During my exploration of failure transparency, I will describe a theory of recovery that yields the two fundamental invariants at work behind transparent, application-generic recovery. The "Save-work" invariant governs when an application must preserve its work in order to conceal failures from users. For failures that affect application state, the "Lose-work" invariant governs how much work the application must lose to avoid failing again during recovery. I will present experiments showing that real applications can get failure transparency in the presence of stop failures with overhead of 0-12%. I will also describe less encouraging experiments that suggest applications may violate one invariant while upholding the other in 90% of application faults, and 3-15% of OS faults, rendering failure transparency impossible for these cases.