CS262B: Shared Memory Multiprocessors
Eric A. Brewer and Joe Hellerstein
March 1, 2001
Background
-
Amdahl's law: speedup = T(1)/T(P) = where T(p) is the
time on p processors. T(p) = T_s + T_p(p), where T_s is the serial
part of the program, and T_p(p) is the time on p processors for the parallel
part. T_s provides an upper bound on speedup and was considered an
argument against massively parallel computing. Is it wrong?
-
Strided references: a sequence of non-contiguous memory references
with a regular stride. Very common for reading a row of a matrix when it
is stored by columns (or a column when stored by rows).
-
Gather: move a sparse group of locations into a contiguous vector
(such as after a strided read); Scatter is to put them back
Cray T3E
-
3d torus, 64 nodes implies 4x4x4, implies average hops = 3 (1/dimension)
-
Shared memory: large global address space, but not caching
-
Goal: bandwidth, low overhead, low latency
-
Latency tolerance (instead of caching)
-
Load/stores used to enable very fast message passing
-
Barrier/Eureka network in hardware, but not separate nework, just special
packets. Eureka is used to indicate that at least one node has met
a condition (versus all for barrier)
-
E-registers:
-
enable larger address space (128MB vs 8MB)
-
translation on both sides (into and out of global virtual addresses)
-
translation include cetrifuge for complex data layouts
-
enables atomic memory operations
-
full/empty bit on each one. Note: "RAW hazard" means a read-after-write
conflict, ie. reading something whose correct value is still being written
-
Little's law determines how many to have: given target bandwith, and (current
implementation's) latency => how many you need to keep the network busy
-
Message Passing: remote enqueue with either interrupts or polling
-
Fuzzy (split-phase) barriers
-
Does not exploit locality (much), hurts irregular apps such as MP3D (in
Alewife paper)
-
Some cases caching hurts, such as producer/consumer (remote write bring
data locally, which is then pulled back by the consumer)
Alewife
-
2D grid (not torus)
-
shared memory address space with aggressive caching
-
Caching:
-
directory based (list of nodes with a copy, single writer, multiple readers).
Hardware holds up to four in list, the rest in software
-
message passing:
-
low latency atomic message launch, interrupts or polling
-
cache coherent DMA (hard)
-
Synchronization:
-
full-empty bits on every word => problem for floating point! (30-bit words)
-
Latency tolerance:
-
block multithreading: four contexts, single cycle context switch
-
Switch on cache miss or Full/empty error
-
scheduler decides which four threads to load (load/unload = traditional
context switch)
-
livelock case: switch on cache miss, then other thread throws out the value
before you switch back
-
prefetching (hint only)
Message Passing vs Shared Memory:
-
Message passing:
-
more portable?
-
must explicitly partition the program
-
Simpler parallelism: encapsulation provided by local memory (ie. separate
namespaces)
-
Easier to optimize automatically
-
easier to detect parallel
-
Shared memory
-
easier migration of legacy programs, but need heavy tuning
-
simpler model for programmers, but hard to optimize (must explicitly partition
to optimize)
-
very hard to recover from faults: every load/store could raise an
exception
-
hard to predict performance (cached/local/remote?)