Advanced Topics in Computer Systems

10/15/01

Anthony Joseph & Joe Hellerstein

 

Experience With Processes and Monitors in Mesa

Motivation

·        Putting theory to practice

·        Focus of this paper: lightweight processes (threads in today’s terminology) and how they synchronize with each other.

 

History

·        2nd system; followed the Alto.

·        Planned to build a large system using many programmers. (Some thoughts about commercializing.)

·        Advent of things like server machines and networking introduced applications that are heavy users of concurrency.

·        Chose to build a single address space system:

o       Single user system, so protection not an issue. (Safety was to come from the language.)

o       Wanted global resource sharing.

·        Large system, many programmers, many applications:

o       Module-based programming with information hiding.

·        Since they were starting “from scratch”, they could integrate the hardware, the runtime software, and the language with each other.

 

Programming model for inter-process communication: shared memory (monitors) vs. message passing.

·        Needham & Lauer claimed the two models are duals of each other.

·        Chose shared memory model because they thought they could fit it into Mesa as a language construct more naturally.

 

How to synchronize processes?

·        Non-preemptive scheduler: tends to yield very delicate systems. Why?

o       Have to know whether or not a yield might be called for every procedure you call. Violates information hiding.

o       Prohibits multiprocessor systems.

o       Need a separate preemptive mechanism for I/O anyway.

o       Can’t do multiprogramming across page faults.

·        Simple locking (e.g. semaphores): too little structuring discipline, e.g. no guarantee that locks will be released on every code path; wanted something that could be integrated into a Mesa language construct.

·        Chose preemptive scheduling of lightweight processes and monitors.

 

Lightweight processes:

·        Easy forking and synchronization

·        Shared address space

·        Fast performance for creation, switching, and synchronization; low storage overhead.

 

Monitors:

·        Monitor lock (for synchronization)

·        Tied to module structure of the language: makes it clear what’s being monitored.

·        Language automatically acquires and releases the lock.

·        Tied to a particular invariant, which helps users think about the program

·        Condition variable (for scheduling)

·        Dangling references similar to pointers. There are also language-based solutions that would prohibit these kinds of errors, such as do-across, which is just a parallel control structure. It eliminates dangling processes because the syntax defines the point of the fork and the join.

 

Changes made to design and implementation issues encountered:

·        3 types of procedures in a monitor module:

o       Entry (acquires and releases lock).

o       Internal (no locking done): can’t be called from outside the module.

o       External (no locking done): externally callable. Why is this useful?

§         Allows grouping of related things into a module.

§         Allows doing some of the work outside the monitor lock.

§         Allows controlled release and reacquisition of monitor lock.

·        Notify semantics:

o       Cede lock to waking process: too many context switches. Why would this approach be desirable?
(Waiting process knows the condition it was waiting on is guaranteed to hold.)

o       Notifier keeps lock, waking process gets put a front of monitor queue. Doesn’t work in the presence of priorities.

o       Notifier keeps lock, wakes process with no guarantees => waking process must recheck its condition.

 

What other kinds of notification does this approach enable?

·        Timeouts, broadcasts, aborts.

·        Deadlocks: Wait only releases the lock of the current monitor, not any nested calling monitors. This is a general problem with modular systems and synchronization:

o       Synchronization requires global knowledge about locks, which violates the information hiding paradigm of modular programming.

·        Why is monitor deadlock less onerous than the yield problem for non-preemptive schedulers?

o       Want to generally insert as many yields as possible to provide increased concurrency; only use locks when you want to synchronize.

o       Yield bugs are difficult to find (symptoms may appear far after the bogus yield)

·         Basic deadlock rule: no recursion, direct or mutual

o       Alternatives? Impose ordering on acquisition

·        Lock granularity: introduced monitored records so that the same monitor code could handle multiple instances of something in parallel.

 

Interrupts:

·        Interrupt handler can’t afford to wait to acquire a monitor lock.

·        Introduced naked notifies: notifies done without holding the monitor lock.

·        Had to worry about a timing race: the notify could occur between a monitor’s condition check and its call on Wait. Added a wakeup-waiting flag to condition variables.

·        What happens with active messages that need to acquire a lock? (move handler to its own thread)

 

Priority Inversion

·        High-priority processes may block on lower-priority processes

·        A solution: temporarily increase the priority of the holder of the monitor to that of the highest priority blocked process (somewhat tricky -- what happens when that high-priority process finishes with the monitor? You have to know the priority of the next highest Þ keep them sorted or scan the list on exit)

 

The Mars Pathfinder mission was widely proclaimed as "flawless" in the early days after its July 4th, 1997 landing on the Martian surface.

·        Successes included its unconventional "landing" -- bouncing onto the Martian surface surrounded by airbags, deploying the Sojourner rover, and gathering and transmitting voluminous data back to Earth, including the panoramic pictures that were such a hit on the Web.

·        A few days later, just after Pathfinder started gathering meteorological data…

o       The spacecraft began experiencing total system resets, each resulting in losses of data. The press reported these failures in terms such as "software glitches" and "the computer was trying to do too many things at once".

·        Internally there was an “information bus”

o       Shared memory area used for passing information between different components of the spacecraft.

·        Tasks:

o       Bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes).

o       Meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue.

o       Communications task that ran with medium priority.

·        Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the (medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running. After some time had passed, a watchdog timer would go off, notice that the data bus task had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.

o       This scenario is a classic case of priority inversion.

 

Really remote debugging

·        Pathfinder used VxWorks

·        VxWorks can be run in a mode where it records a total trace of all interesting system events, including context switches, uses of synchronization objects, and interrupts. After the failure, JPL engineers spent hours and hours running the system on the exact spacecraft replica in their lab with tracing turned on, attempting to replicate the precise conditions under which they believed that the reset occurred.

·        Early in the morning, after all but one engineer had gone home, the engineer finally reproduced a system reset on the replica. Analysis of the trace revealed the priority inversion.

 

Remote bug fixing

·        When created, a VxWorks mutex object accepts a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on it while it held the mutex, causing it be scheduled with higher priority than the medium-priority communications task, thus preventing the priority inversion. Once diagnosed, it was clear to the JPL engineers that using priority inheritance would prevent the resets they were seeing.

·        VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed on the fly during system debugging. The JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled. By coding convention, the initialization parameter for the mutex in question (and those for two others which could have caused the same problem) were stored in global variables, whose addresses were in symbol tables also included in the launch software, and available to the C interpreter. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. No more system resets occurred.

 

Temporal Logic Assertions for the Detection of Priority Inversion

·        Let:

o        HIGHPriorityTaskBlocked() represent a situation where the information bus thread is blocked by the low priority meteorological data gathering task.

o       HIGHPriorityTaskInMutex() represent a situation where the information bus thread is in Mutex.

o       LOWPriorityTaskInMutex() represent a situation where the meteorological thread is in Mutex.

o       MEDPriorityTaskRunning() represent a situation where the communications task is running.

·        Assertions:

o        Not Eventually ({HIGHPriorityTaskBlocked()} And {MEDPriorityTaskRunning()})

o       This formula specifies that never should it be that HIGHPriorityTaskBlocked() And MEDPriorityTaskRunning().

o       Always({LOWPriorityTaskInMutex()} Implies Not {MEDPriorityTaskRunning()} Until {HIGHPriorityTaskInMutex()} )

o       This formula specifies that always, if LOWPriorityTaskInMutex() then MEDPriorityTaskRunning() does not occur until a later time when HIGHPriorityTaskInMutex().

·        Using such assertions (written as comments in the Pathfinder code), the Temporal Rover would generate code that announces success and/or failure of any assertion during testing.

·        Interestingly enough, the JPL engineers actually created a priority inversion situation during testing

o       1-2 system resets during months of pre-flight testing. Not reproducible or explainable, so “it was probably caused by a hardware glitch”.

·        Did not manage to analyze their recorded data well enough so to conclude that priority inversion is indeed a bug in their system. In other words, their test runs were sufficient, but their analysis tools were not.

o       Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software.

o       Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.

 

Exceptions:

·        Must restore monitor invariant as you unwind the stack.

o       But, requires explicit UNWIND handlers, otherwise lock is not released

·        What does Java do?

o       Release lock, no UNWIND primitive

 

Hints vs. Guarantees:

·        Notify is only a hint.

·        Þ Don’t have to wake up the right process

·        Þ Don’t have to change the notifier if we slightly change the wait condition (the two are decoupled).

·        Þ easier to implement, because it’s always OK to wake up too many processes. If we get lost, we could even wake up everybody (broadcast)

o       Can we use broadcast everywhere there is a notify? Yes

o       Can we use notify everywhere there is a broadcast? No, might not have satisfied OK to proceed for A, have satisfied it for B.

·        Enables timeouts and aborts

·        General Principle: use hints for performance that have little or better yet no effect on the correctness.

o       Many commercial systems use hints for fault tolerance: if the hint is wrong, things timeout and use a backup strategy

o       Þ performance hit for incorrect hint, but no errors.

 

Performance:

·        Assumes simple machine architecture

o       Single execution, non-pipelined

o       What about multi-processors?

·        Context switch is very fast: 2 procedure calls.

·        Ended up not mattering much for performance:

o       Ran only on uniprocessor systems.

o       Concurrency mostly used for clean structuring purposes.

·        Procedure calls are slow: 30 instructions (RISC proc. calls are 10x faster). Why?

o       Due to heap allocated procedure frames. Why did they do this?

§         Didn’t want to worry about colliding process stacks.

o       Mental model was “any procedure call might be a fork”: transfer was basic control transfer primitive.

·        Process creation: ~ 1100 instructions

o       Good enough most of the time.

o       Fast-fork package implemented later that keeps around a pool or “available” processes.

 

3 key features about the paper:

·        Describes the experiences designers had with designing, building and using a large system that aggressively relies on light-weight processes and monitor facilities for all its software concurrency needs.

·        Describes various subtle issues of implementing a threads-with-monitors design in real life for a large system.

·        Discusses the performance and overheads of various primitives and three representative applications, but doesn’t give a big picture of how important various things turned out to be.

 

Some flaws:

·        Gloss over how hard it is to program with locks and exceptions sometimes. (Not clear if there are better ways).

·        Performance discussion doesn’t give the big picture.

o       Tries to be machine-independent, but assumes particular model.

 

A lesson: The light-weight threads-with-monitors programming paradigm can be used to successfully build large systems, but there are subtle points that have to be correct in the design and implementation in order to do so.