CS 294-8 Homework 1

Failure discovered in the Space Shuttle Control Software in October, 1981

by Dan Bonachea

Redundancy is arguably the most important technique for achieving hardware and software fault-tolerance in mission-critical systems. In software systems, redundancy is achieved by running several copies of the software simultaneously on independent hardware. In general, these systems use some form of checkpointing to compare outputs or external process monitoring to detect failures and somehow combine the outputs deemed "correct", generating a more reliable answer. The efficacy of this technique is fundamentally based on the so-called "Heisenbug Hypothesis", which postulates that the most bugs in production systems are transient bugs (Heisenbugs) that are triggered by timing fluctuations and race conditions.

Experience reveals that software redundancy can indeed be effective in combating this source of software faults. However there is an important class of software problems (namely, the more systematic "Bohr bugs"), that can easily make such software redundancy techniques useless. Suppose for a moment that a running piece of software actually does contain a systematic bug – in this case it’s entirely possible that all redundant copies will return an equally incorrect answer, meaning that the seemingly wonderful N-way redundant software is no more reliable than a single, non-redundant copy of the software. The report of the NASA space shuttle control problem discovered in 1981 highlights the fact that systematic "Bohr" bugs do exist, even in carefully written and painstakingly tested mission-critical software.

The story basically goes as follows: The planned lift-off of the space shuttle Columbia on Oct. 9, 1981 was delayed due to a minor fuel spill and a few missing tiles, so the crew decided to put in some extra time on the shuttle mission simulator. They decided to simulate a "trans-atlantic abort sequence", a backup plan used when the shuttle can’t achieve orbit - the mission is aborted and the shuttle is redirected to an early landing in Spain. When the crew issued the mission abort command, all four of the redundant flight computers simultaneously locked up and became completely unresponsive. All displays went blank ("showing a big X") and could not be revived. Had this occurred during a real flight, it is unlikely the shuttle could have been safely landed.

Closer inspection revealed the problem occurred in the routine responsible for dumping excess fuel in preparation for the early landing – the code contained an uninitialized counter used in a "computed GOTO" command that resulted in all four machines simultaneously branching off to a memory address containing no code. This in turn led to simultaneous operating system crashes on each redundant system.

A subsequent investigation of the software using the specific knowledge gathered from this incident led to the discovery of 17 other similar systematic bugs in the flight control software, one of which could also have caused a catastrophic failure (both critical problems were quickly fixed before the rescheduled launch on Nov. 12, 1981).

The damage caused by this problem in the production software was luckily minimal because it was discovered during an (unplanned) simulation, but it could have just as easily arisen during a real mission and caused a disaster with a large cost in money and human life.

I think the important lesson to draw from this report is that redundancy is not a perfect solution to providing software fault tolerance, and there are many cases where it provides no help at all. Most complex software systems contain a large number of "corner-cases" (unusual combinations of input that the programmer may not have considered), and Bohr bugs are quite likely to be lurking along rarely-run code paths that might lead to systematic failures across all redundant software copies when the system experiences unusual or unexpected situations. In small systems, it may be possible for testers to cover the vast majority of input combinations and verify correct behavior in all corner-cases, but in more complicated systems such as the shuttle, comprehensive testing is simply intractable and the existence of such bugs is unavoidable.

Proponents of software redundancy claim the answer to the Bohr-bug problem is to hoist redundancy higher in the software development process – namely, develop more than one version of the software using independent development teams, designers, analysts, etc. However, the fact that the final production versions of the software must cooperate significantly once running places serious limits on how far this idea can be pushed. Specifically, a failure to recognize a design problem in the basic interface during requirements specification could lead to independent versions of the software that still fail in a systematic manner.

 

Related Links:

Case study of the Space Shuttle Primary Control System

NASA Mission Summary from STS-2

Abstracts of technology-related air safety incidents