Human Error Support for Complex Internet Systems

By: Archana Ganapathi

 

As Spiderman’s much-loved Uncle Ben said, “With great power comes great responsibility” [4]. The human operator has enormous power, control and the ability to wreak havoc; he must be very meticulous when dealing with complex Internet systems [2]. In addition to making costly mistakes, usually accidental, the operator has the opportunity to abuse his power by malicious intent. In contrast, it is relatively less probable that the software used by Internet services is un-trustworthy. Moreover, it is possible to administer a patch for a software bug when discovered, using dynamic feedback. Unfortunately, similar patches are impossible for operator errors. At best one can maintain a record of “problem-causing operator actions” that convince future operators to review such pitfalls apriori.

 

Fieldwork analyzing failures at three different Internet services reveals that manually induced component failures are far more likely to lead to system failure than component failures due to hardware or software [6]. Human intervention causes a significant number of user-visible failures and is also the hardest failure to mask. In the analysis of time-to-repair, human error again stands out as a significant contributor to repair latency because many hardware and software errors respond quickly to simple solutions like reboot, while human errors tend to produce deeper, more persistent errors, requiring more elaborate recovery activities.

 

The operator-induced service failures were primarily of two types – those resulting from process failures where operator did not follow correct procedures and those that stem from system mis-configurations. The former requires better training; the latter requires better interfaces. Perhaps automatic usability analysis can provide additional insights [5]. Among operator errors, mis-configurations are very common. On occasion, mis-configurations were coupled with bugs in software or broken hardware to produce more impacting service failures [3]. Composition of web-services makes it imperative for constituent services to understand internal configurations including dependence implications on external services. It is important that operators understand the configuration, dependencies, and state of a service, to realize the implications of performing a potentially destructive operation.

 

Human-error research reveals automation irony: automation does not cure human error [1, 7]. Automation, by design, attempts to reduce human intervention. Moreover, it addresses relatively easy tasks that skill and rule-based, leaving humans to solve complex and rare knowledge-based tasks that fall beyond the scope of automation. Furthermore, it reduces opportunities for operators to gain practical experience in building mental production rules as well as troubleshooting models. Lack of system visibility, increased system complexity as well as decreased opportunities for interaction cause operators to be more fallible especially when stressed during service recovery. 

 

Future research should address the following:

Instructional (and Visual) Aid for configurations that enable operators to understand the correct configuration state; this state includes the initial configuration as well as explicit representations for non-erroneous state transitions.  Interactive tools are necessary to assist human operators avoid errors. Visualization of configuration changes as well as dynamic state transitions serves as an introspective step in this direction.

 

References

[1]           Brown, A and Patterson, D.A.

                To Err is Human

                First EASY Workshop, Göteborg, Sweden, July 1, 2001

 

[2]           Brown, A and Patterson, D.A.

                Undo for Operators: Building an Undoable E-mail Store,

                Proc. USENIX Annual Technical Conference, June 2003

 

[3]           Ganapathi, A.

                Failure Analysis of Internet Services,

                UC Berkeley Technical Report UCB//CSD-03-1255, December 2002.

 

[4]          http://www.hollywood.com/movies/detail/movie/370277

 

[5]           Ivory, M and Hearst, M.

                The State of the Art in Automating Usability Evaluation,

                ACM Computing Surveys, 33(4):470–516, December 2001.

 

[6]           Oppenheimer, D., Ganapathi A. & Patterson, D.

                Why do Internet services fail, and what can be done about it?

                4th USENIX Symposium on Internet Technologies & Systems, Seattle, Washington, March 2003.

 

[7]           Reason J. T.. Human error, New York : Cambridge University Press, 1990