Human Error Support for Complex Internet Systems
By: Archana
Ganapathi
As Spiderman’s much-loved
Uncle Ben said, “With great power comes great responsibility” [4]. The human
operator has enormous power, control and the ability to wreak havoc; he must be
very meticulous when dealing with complex Internet systems [2]. In addition to
making costly mistakes, usually accidental, the operator has the opportunity to
abuse his power by malicious intent. In contrast, it is relatively less
probable that the software used by Internet services is un-trustworthy.
Moreover, it is possible to administer a patch for a software bug when
discovered, using dynamic feedback. Unfortunately, similar patches are
impossible for operator errors. At best one can maintain a record of
“problem-causing operator actions” that convince future operators to review
such pitfalls apriori.
Fieldwork analyzing failures
at three different Internet services reveals that manually induced component
failures are far more likely to lead to system failure than component failures
due to hardware or software [6]. Human
intervention causes a significant number of user-visible failures and is also
the hardest failure to mask. In the analysis of time-to-repair, human
error again stands out as a significant contributor to repair latency because
many hardware and software errors respond quickly to simple solutions like
reboot, while human errors tend to produce deeper, more persistent errors,
requiring more elaborate recovery activities.
The operator-induced service
failures were primarily of two types – those resulting from process failures
where operator did not follow correct procedures and those that stem from
system mis-configurations. The former requires better
training; the latter requires better interfaces. Perhaps automatic usability
analysis can provide additional insights [5]. Among operator errors, mis-configurations are very common. On occasion, mis-configurations were coupled with bugs in software or
broken hardware to produce more impacting service failures [3]. Composition of
web-services makes it imperative for constituent services to understand
internal configurations including dependence implications on external services.
It is important that operators understand the configuration, dependencies, and
state of a service, to realize the implications of performing a potentially
destructive operation.
Human-error research reveals automation irony: automation does not cure human error [1, 7]. Automation, by design, attempts to reduce human intervention. Moreover, it addresses relatively easy tasks that skill and rule-based, leaving humans to solve complex and rare knowledge-based tasks that fall beyond the scope of automation. Furthermore, it reduces opportunities for operators to gain practical experience in building mental production rules as well as troubleshooting models. Lack of system visibility, increased system complexity as well as decreased opportunities for interaction cause operators to be more fallible especially when stressed during service recovery.
Future research should
address the following:
Instructional
(and Visual) Aid for configurations that enable operators to understand the
correct configuration state; this state includes the initial configuration as
well as explicit representations for non-erroneous state transitions. Interactive tools are necessary to assist
human operators avoid errors. Visualization of configuration changes as well as
dynamic state transitions serves as an introspective step in this direction.
References
[1] Brown, A and Patterson, D.A.
To Err is Human
First EASY Workshop,
[2] Brown,
A and Patterson, D.A.
Undo for
Operators: Building an Undoable E-mail Store,
Proc. USENIX Annual Technical Conference, June 2003
[3] Ganapathi, A.
Failure Analysis of Internet Services,
UC Berkeley Technical Report UCB//CSD-03-1255,
December 2002.
[4] http://www.hollywood.com/movies/detail/movie/370277
[5] Ivory, M and Hearst, M.
The State of the Art in Automating Usability
Evaluation,
ACM Computing Surveys, 33(4):470–516, December 2001.
[6] Oppenheimer, D., Ganapathi A. & Patterson, D.
Why do Internet services fail, and what can be done
about it?
4th USENIX Symposium on Internet Technologies &
Systems,
[7] Reason J. T.. Human error,