To approach this problem, we would create a cooperative watchdog that is given a signal every so often that the application is still running. If this signal is not received, the watchdog would then perform an appropriate action, such as killing the application. This is not a full-proof watchdog since a faulty application could be contacting the watchdog yet not doing anything useful. We would also try to determine ways to detect this kind of failure. Additionally, we would implement solutions that specifically address the failures and try to degrade gracefully instead of just failing.
This is the specification document for the Data Flash chip that is on the Mica motes.
This is the specification document for the Atmel main processor chip that is on the Mica motes.
This is the specification document for the Atmel coprocessor chip that is on the Mica motes.
The paper discusses hardware fault tolerance measures for massively parallel multiprocessors. The paper talks about 3 techniques: 1) Self-checking, 2) Central checking, 3) Distributed system-wide checking. The authors also mention checkpointing to resume from failure by a temporarily failed node. Lastly, the paper discusses how to go about evaluating a fault tolerant mechanism by fault injection. Overall, the paper provides more summary and no useful results or conclusions.
This paper discusses a toolset that handles application-level fault tolerance by error detection, isolation and recovery. The tools provided in the TIRAN toolset include watchdog timers and trap handlers. The monitoring process is achieved by added a special component, called a backbone, that acts as the communication between the toolset and the user application. Since this is for distributed systems, this component is included in every system. The backbone receives the error detection and fault masking events and relays this to the tools for handling. Of particular interest in the paper is the self-check task that is run through a distributed algorithm, described in detail. There are several agents in the distributed system which communicate to perform the self-check. Essentially, they communicate to indicate that the agent is alive and monitor that this message is received in a valid period. If not, then a message is broadcast that lapsed entity is faulty.
This paper describes the architecture used for a wireless network sensor. It specifically describes the Mica motes' architecture, which is of particular interest to us. It also describes the protocols used in the wireless communication and discusses the reasons for the design decisions.
This paper addresses the topic of increasing system reliability by having a redundant system with N microprocessor units, each containing an independent watchdog processor (WDP) that does not fail. Of particular interest to our project is the watchdog processor. This processor detects errors by monitoring the microprocessor’s control flow and memory access behavior. The WDP detects errors with probability p, called the coverage of a WDP. If the watchdog does not detect errors, the whole system can fail. If the watchdog detects an error, it resets the microprocessor to an initial state and increments a counter for the number of resets. At some predefined time T, each reset counter is checked to see if it is greater than some K. If so, the microprocessor is determined to be permanently faulty, at which time a standby microprocessor unit is activated. To further increases the system reliability, the coverage of a WDP needs to be improved.
This paper presents a look at fault tolerence within a large distributed network system. A model of a large multiprocessor system isdiscussed and techniques are given by which each processing element can correctly diagnose failures in all other processing elements in the system.
This paper describes an algorithm called Signature Encoded Instruction Stream (SEIS) that produces signatures to be used by watchdog processors (WDP). The WDP often performs control checking by the assigned signature based approach. It compares the program control-flow graph (CFG), which is extracted by the preprocessor from the high level programming language code, to the execution of the program. In previous implementations, the CFG was transferred to the watchdog processor before execution. The SEIS algorithm is a different encoding of the program CFG signatures such that only a portion of the signature subgraph is stored in the watchdog processor. This reduces the hardware and time complexity of the watchdog processor. Additionally, the watchdog processor supports error recovery. This is achieved by using a backward recovery strategy where intermediate checkpoints are saved during computation, along with the WDP state at that point. When program recovery is needed, the WDP state is restored as well.