CS 252 Final Project

Watchdog Designs for TinyOS Motes
Group Members:

Ali Lakhia
Hayley Iben
Rachel Rubin
Final Results:

Paper (pdf) (ps)
Presentation (ppt)
Movie 4.3 MB

Project Proposal:

We propose to implement a watchdog architecture for the mica mote running TinyOS that would self-diagnose a problem and take an appropriate action. There are several issues that we need to address to perform this project. First, we need to determine the kind of hardware support needed for the watchdog. After deciding on a design, we will integrate the hardware with a mote. Then, we will determine what kind of failures the watchdog can detect and resolve.

To approach this problem, we would create a cooperative watchdog that is given a signal every so often that the application is still running. If this signal is not received, the watchdog would then perform an appropriate action, such as killing the application. This is not a full-proof watchdog since a faulty application could be contacting the watchdog yet not doing anything useful. We would also try to determine ways to detect this kind of failure. Additionally, we would implement solutions that specifically address the failures and try to degrade gracefully instead of just failing.


Details:

Watchdog Architecture Design
Failure Handling
Scenario Ideas
Analysis Ideas
Project Timeline
Future Directions

Annotated Bibliography:

Amtel Corporation, AT45DB041B Data Sheet, 2001.

This is the specification document for the Data Flash chip that is on the Mica motes.

Amtel Corporation, ATmega103(L) Data Sheet, 2001.

This is the specification document for the Atmel main processor chip that is on the Mica motes.

Amtel Corporation, AT90S2323/LS2323/S2343/LS2343 Data Sheet, 2001.

This is the specification document for the Atmel coprocessor chip that is on the Mica motes.

Mario Dan Cin, Wolfgang Hohl, Volkmar Sieh, "Hardware-Supported Fault Tolerance for Multiprocessors"

The paper discusses hardware fault tolerance measures for massively parallel multiprocessors. The paper talks about 3 techniques: 1) Self-checking, 2) Central checking, 3) Distributed system-wide checking. The authors also mention checkpointing to resume from failure by a temporarily failed node. Lastly, the paper discusses how to go about evaluating a fault tolerant mechanism by fault injection. Overall, the paper provides more summary and no useful results or conclusions.

V. De Florio, G. Deconinck and R. Lauwereins, "An Algorithm for Tolerating Crash Failures in Distributed Systems."

This paper discusses a toolset that handles application-level fault tolerance by error detection, isolation and recovery. The tools provided in the TIRAN toolset include watchdog timers and trap handlers. The monitoring process is achieved by added a special component, called a backbone, that acts as the communication between the toolset and the user application. Since this is for distributed systems, this component is included in every system. The backbone receives the error detection and fault masking events and relays this to the tools for handling. Of particular interest in the paper is the self-check task that is run through a distributed algorithm, described in detail. There are several agents in the distributed system which communicate to perform the self-check. Essentially, they communicate to indicate that the agent is alive and monitor that this message is received in a valid period. If not, then a message is broadcast that lapsed entity is faulty.

Jason Hill and David Culler, "A wireless embedded sensor architecture for system-level optimization." 2001.

This paper describes the architecture used for a wireless network sensor. It specifically describes the Mica motes' architecture, which is of particular interest to us. It also describes the protocols used in the wireless communication and discusses the reasons for the design decisions.

M. Imaizumi, K. Yasui and T. Nakagawa, "An Optimal Number of Microprocessor Units with Watchdog Processor," Mathematical and Computer Modelling 31, 183-189, (2000).

This paper addresses the topic of increasing system reliability by having a redundant system with N microprocessor units, each containing an independent watchdog processor (WDP) that does not fail. Of particular interest to our project is the watchdog processor. This processor detects errors by monitoring the microprocessor’s control flow and memory access behavior. The WDP detects errors with probability p, called the coverage of a WDP. If the watchdog does not detect errors, the whole system can fail. If the watchdog detects an error, it resets the microprocessor to an initial state and increments a counter for the number of resets. At some predefined time T, each reset counter is checked to see if it is greater than some K. If so, the microprocessor is determined to be permanently faulty, at which time a standby microprocessor unit is activated. To further increases the system reliability, the coverage of a WDP needs to be improved.

J. G. Kuhl and S. M. Reddy. "Distributed fault-tolerance for large multiprocessor systems." International Conference on Computer Architecture, Conference proceedings of the seventh annual symposium on Computer Architecture. 1980, La Baule, United States

This paper presents a look at fault tolerence within a large distributed network system. A model of a large multiprocessor system isdiscussed and techniques are given by which each processing element can correctly diagnose failures in all other processing elements in the system.

A. Pataricza, I. Majzik, W. Hohl, and J. Honig, "Watchdog Processors in Parallel Systems," EUROMICRO'93, 19th Symposium on Microprocessing and Microprogramming, Barcelona, Spain, 1993.

This paper describes an algorithm called Signature Encoded Instruction Stream (SEIS) that produces signatures to be used by watchdog processors (WDP). The WDP often performs control checking by the assigned signature based approach. It compares the program control-flow graph (CFG), which is extracted by the preprocessor from the high level programming language code, to the execution of the program. In previous implementations, the CFG was transferred to the watchdog processor before execution. The SEIS algorithm is a different encoding of the program CFG signatures such that only a portion of the signature subgraph is stored in the watchdog processor. This reduces the hardware and time complexity of the watchdog processor. Additionally, the watchdog processor supports error recovery. This is achieved by using a backward recovery strategy where intermediate checkpoints are saved during computation, along with the WDP state at that point. When program recovery is needed, the WDP state is restored as well.