The Theory, Performance, and Limitations of Generic Recovery
Dave Lowell
Compaq Western Research Lab
Dealing with failures by the hardware, OS, or application is unpleasant
for both the user and programmer. As a result, the lure of operating
systems that transparently handle all failures is quite powerful. In this
talk, I will explore the OS abstraction of "failure transparency" in which
the OS generates the illusion of failure-free operation. To provide this
illusion, the operating system must handle all hardware, software, and
application failures to keep them from affecting what the user sees.
Furthermore, the OS must do so without help from the programmer and
without unduly slowing failure-free operation. During my exploration of
failure transparency, I will describe a theory of recovery that yields the
two fundamental invariants at work behind transparent, application-generic
recovery. The "Save-work" invariant governs when an application must
preserve its work in order to conceal failures from users. For failures
that affect application state, the "Lose-work" invariant governs how much
work the application must lose to avoid failing again during recovery. I
will present experiments showing that real applications can get failure
transparency in the presence of stop failures with overhead of 0-12%. I
will also describe less encouraging experiments that suggest applications
may violate one invariant while upholding the other in 90% of application
faults, and 3-15% of OS faults, rendering failure transparency impossible
for these cases.