After 15 years of successfully improving cost-performance, its time
new challenges for the systems research community.
As a result of the focus on cost-performance, the fabled five 9s of
availability looks to be much easier to achieve on billboards than in computers, and the managing systems with
state can be ten times the cost of the equipment. In a PostPC Era of wireless gadgets using services on the
Internet, one new challenge is building services that really are dependable and much less expensive to maintain.
Traditional Fault Tolerant Computing concentrates on tolerating hardware
and operating system faults, ignoring faults by human operators and even applications. Recovery
Oriented Computing (ROC) aims at improving Mean Time To Recover to both lower the cost of management and
improve at the availability of whole system, including the people who operate it. We look to civil
engineering and studies of disasters to inspire principles for ROC design.
This talk outlines our tentative research agenda and proposed principles
of ROC design, plus some concrete results in the area of bench marking of availability and first
pass design of the hardware of server built along these lines, called ROC-I.