Advanced Topics in Computer Systems |
10/15/01 |
Anthony Joseph & Joe Hellerstein |
|
·
Putting theory
to practice
·
Focus of this
paper: lightweight processes (threads in today’s terminology) and how they
synchronize with each other.
·
2nd system;
followed the Alto.
·
Planned to
build a large system using many programmers. (Some thoughts about
commercializing.)
·
Advent of
things like server machines and networking introduced applications that are
heavy users of concurrency.
·
Chose to build
a single address space system:
o Single user system, so protection not an
issue. (Safety was to come from the language.)
o Wanted global resource sharing.
·
Large system,
many programmers, many applications:
o Module-based programming with information
hiding.
·
Since they were
starting “from scratch”, they could integrate the hardware, the runtime
software, and the language with each other.
·
Needham &
Lauer claimed the two models are duals of each other.
·
Chose shared
memory model because they thought they could fit it into Mesa as a language
construct more naturally.
·
Non-preemptive
scheduler: tends to yield very delicate systems. Why?
o Have to know whether or not a yield might be
called for every procedure you call. Violates information hiding.
o Prohibits multiprocessor systems.
o Need a separate preemptive mechanism for I/O
anyway.
o Can’t do multiprogramming across page faults.
·
Simple locking
(e.g. semaphores): too little structuring discipline, e.g. no guarantee that
locks will be released on every code path; wanted something that could be
integrated into a Mesa language construct.
·
Chose
preemptive scheduling of lightweight processes and monitors.
·
Easy forking
and synchronization
·
Shared address
space
·
Fast
performance for creation, switching, and synchronization; low storage overhead.
·
Monitor lock
(for synchronization)
·
Tied to module structure of the language:
makes it clear what’s being monitored.
·
Language
automatically acquires and releases the lock.
·
Tied to a particular invariant, which helps
users think about the program
·
Condition
variable (for scheduling)
·
Dangling
references similar to pointers. There are also language-based solutions that
would prohibit these kinds of errors, such as do-across, which is just a
parallel control structure. It eliminates dangling processes because the syntax
defines the point of the fork and the join.
·
3 types of
procedures in a monitor module:
o Entry
(acquires and releases lock).
o Internal (no locking done): can’t be called
from outside the module.
o External (no locking done): externally callable. Why is this useful?
§ Allows grouping of related things into a module.
§ Allows doing some of the work outside the monitor lock.
§
Allows controlled release and reacquisition of monitor
lock.
·
Notify
semantics:
o Cede
lock to waking process: too many context switches. Why would this approach be
desirable?
(Waiting process knows the condition it was waiting on is guaranteed to hold.)
o Notifier keeps lock, waking process gets put
a front of monitor queue. Doesn’t work in the presence of priorities.
o Notifier keeps lock, wakes process with no guarantees => waking process must recheck its condition.
·
Timeouts, broadcasts, aborts.
·
Deadlocks: Wait
only releases the lock of the current monitor, not any nested calling monitors.
This is a general problem with modular systems and synchronization:
o Synchronization requires global knowledge
about locks, which violates the information hiding paradigm of modular
programming.
·
Why is monitor
deadlock less onerous than the yield problem for non-preemptive schedulers?
o Want to generally insert as many yields as
possible to provide increased concurrency; only use locks when you want to
synchronize.
o Yield bugs are difficult to find (symptoms
may appear far after the bogus yield)
·
Basic deadlock rule: no recursion, direct or
mutual
o Alternatives? Impose ordering on acquisition
·
Lock
granularity: introduced monitored records so that the same monitor code could
handle multiple instances of something in parallel.
·
Interrupt
handler can’t afford to wait to acquire a monitor lock.
·
Introduced
naked notifies: notifies done without holding the monitor lock.
·
Had to worry
about a timing race: the notify could occur between a monitor’s condition check
and its call on Wait. Added a wakeup-waiting flag to condition variables.
·
What happens
with active messages that need to acquire a lock? (move handler to its own
thread)
·
High-priority processes may block on
lower-priority processes
·
A solution:
temporarily increase the priority of the holder of the monitor to that of the
highest priority blocked process (somewhat tricky -- what happens when that
high-priority process finishes with the monitor? You have to know the priority
of the next highest Þ keep them sorted or
scan the list on exit)
·
Successes
included its unconventional "landing" -- bouncing onto the Martian
surface surrounded by airbags, deploying the Sojourner rover, and gathering and
transmitting voluminous data back to Earth, including the panoramic pictures
that were such a hit on the Web.
·
A few days
later, just after Pathfinder started gathering meteorological data…
o The spacecraft began experiencing total
system resets, each resulting in losses of data. The press reported these
failures in terms such as "software glitches" and "the computer
was trying to do too many things at once".
·
Internally there
was an “information bus”
o Shared memory area used for passing
information between different components of the spacecraft.
·
Tasks:
o Bus management task ran frequently with high
priority to move certain kinds of data in and out of the information bus.
Access to the bus was synchronized with mutual exclusion locks (mutexes).
o Meteorological data gathering task ran as an
infrequent, low priority thread, and used the information bus to publish its
data. When publishing its data, it would acquire a mutex, do writes to the bus,
and release the mutex. If an interrupt caused the information bus thread to be
scheduled while this mutex was held, and if the information bus thread then
attempted to acquire this same mutex in order to retrieve published data, this
would cause it to block on the mutex, waiting until the meteorological thread
released the mutex before it could continue.
o Communications task that ran with medium
priority.
·
Most of the
time this combination worked fine. However, very infrequently it was possible
for an interrupt to occur that caused the (medium priority) communications task
to be scheduled during the short interval while the (high priority) information
bus thread was blocked waiting for the (low priority) meteorological data
thread. In this case, the long-running communications task, having higher
priority than the meteorological task, would prevent it from running,
consequently preventing the blocked information bus task from running. After
some time had passed, a watchdog timer would go off, notice that the data bus
task had not been executed for some time, conclude that something had gone
drastically wrong, and initiate a total system reset.
o This scenario is a classic case of priority
inversion.
·
Pathfinder used
VxWorks
·
VxWorks can be
run in a mode where it records a total trace of all interesting system events,
including context switches, uses of synchronization objects, and interrupts.
After the failure, JPL engineers spent hours and hours running the system on
the exact spacecraft replica in their lab with tracing turned on, attempting to
replicate the precise conditions under which they believed that the reset
occurred.
·
Early in the
morning, after all but one engineer had gone home, the engineer finally
reproduced a system reset on the replica. Analysis of the trace revealed the
priority inversion.
·
When created, a
VxWorks mutex object accepts a boolean parameter that indicates whether
priority inheritance should be performed by the mutex. The mutex in question
had been initialized with the parameter off; had it been on, the low-priority
meteorological thread would have inherited the priority of the high-priority
data bus thread blocked on it while it held the mutex, causing it be scheduled
with higher priority than the medium-priority communications task, thus
preventing the priority inversion. Once diagnosed, it was clear to the JPL
engineers that using priority inheritance would prevent the resets they were
seeing.
·
VxWorks
contains a C language interpreter intended to allow developers to type in C
expressions and functions to be executed on the fly during system debugging.
The JPL engineers fortuitously decided to launch the spacecraft with this
feature still enabled. By coding convention, the initialization parameter for
the mutex in question (and those for two others which could have caused the
same problem) were stored in global variables, whose addresses were in symbol
tables also included in the launch software, and available to the C interpreter.
A short C program was uploaded to the spacecraft, which when interpreted,
changed the values of these variables from FALSE to TRUE. No more system resets
occurred.
·
Let:
o HIGHPriorityTaskBlocked() represent a situation where the
information bus thread is blocked by the low priority meteorological data
gathering task.
o HIGHPriorityTaskInMutex() represent a
situation where the information bus thread is in Mutex.
o LOWPriorityTaskInMutex() represent a
situation where the meteorological thread is in Mutex.
o MEDPriorityTaskRunning() represent a
situation where the communications task is running.
·
Assertions:
o Not Eventually ({HIGHPriorityTaskBlocked()} And
{MEDPriorityTaskRunning()})
o This formula specifies that never should it
be that HIGHPriorityTaskBlocked() And MEDPriorityTaskRunning().
o Always({LOWPriorityTaskInMutex()} Implies Not
{MEDPriorityTaskRunning()} Until {HIGHPriorityTaskInMutex()} )
o This formula specifies that always, if
LOWPriorityTaskInMutex() then MEDPriorityTaskRunning() does not occur until a
later time when HIGHPriorityTaskInMutex().
·
Using such
assertions (written as comments in the Pathfinder code), the Temporal Rover
would generate code that announces success and/or failure of any assertion
during testing.
·
Interestingly
enough, the JPL engineers actually created a priority inversion situation
during testing
o 1-2 system resets during months of pre-flight
testing. Not reproducible or explainable, so “it was probably caused by a
hardware glitch”.
·
Did not manage
to analyze their recorded data well enough so to conclude that priority
inversion is indeed a bug in their system. In other words, their test runs were
sufficient, but their analysis tools were not.
o Part of it too was the engineers' focus. They
were extremely focused on ensuring the quality and flawless operation of the
landing software.
o Should it have failed, the mission would have
been lost. It is entirely understandable for the engineers to discount
occasional glitches in the less-critical land-mission software, particularly
given that a spacecraft reset was a viable recovery strategy at that phase of
the mission.
·
Must restore
monitor invariant as you unwind the stack.
o But, requires explicit UNWIND handlers,
otherwise lock is not released
·
What does Java
do?
o Release lock, no UNWIND primitive
·
Notify is only
a hint.
·
Þ Don’t have to wake up
the right process
·
Þ Don’t have to change
the notifier if we slightly change the wait condition (the two are decoupled).
·
Þ easier to implement,
because it’s always OK to wake up too many processes. If we get lost, we could
even wake up everybody (broadcast)
o Can we use broadcast everywhere there is a
notify? Yes
o Can we use notify everywhere there is a
broadcast? No, might not have satisfied OK to proceed for A, have satisfied it
for B.
·
Enables
timeouts and aborts
·
General
Principle: use hints for performance that have little or better yet no effect
on the correctness.
o Many commercial systems use hints for fault
tolerance: if the hint is wrong, things timeout and use a backup strategy
o Þ performance hit for incorrect hint, but no
errors.
·
Assumes simple
machine architecture
o Single execution, non-pipelined
o What about multi-processors?
·
Context switch
is very fast: 2 procedure calls.
· Ended up not mattering much for performance:
o Ran only on uniprocessor systems.
o Concurrency
mostly used for clean structuring purposes.
·
Procedure calls
are slow: 30 instructions (RISC proc. calls are 10x faster). Why?
o Due to heap allocated procedure frames. Why
did they do this?
§
Didn’t want to
worry about colliding process stacks.
o Mental model was “any procedure call might be
a fork”: transfer was basic control transfer primitive.
·
Process
creation: ~ 1100 instructions
o Good enough most of the time.
o Fast-fork package implemented later that
keeps around a pool or “available” processes.
·
Describes the
experiences designers had with designing, building and using a large system
that aggressively relies on light-weight processes and monitor facilities for
all its software concurrency needs.
·
Describes
various subtle issues of implementing a threads-with-monitors design in real
life for a large system.
·
Discusses the
performance and overheads of various primitives and three representative
applications, but doesn’t give a big picture of how important various things
turned out to be.
·
Gloss over how
hard it is to program with locks and exceptions sometimes. (Not clear if there
are better ways).
·
Performance
discussion doesn’t give the big picture.
o Tries to be machine-independent, but assumes
particular model.
A lesson: The light-weight threads-with-monitors programming paradigm can be used to successfully build large systems, but there are subtle points that have to be correct in the design and implementation in order to do so.