CS262B: Shared Memory Multiprocessors

Eric A. Brewer and Joe Hellerstein
March 1, 2001

Amdahl's law: speedup = T(1)/T(P) = where T(p) is the time on p processors. T(p) = T_s + T_p(p), where T_s is the serial part of the program, and T_p(p) is the time on p processors for the parallel part. T_s provides an upper bound on speedup and was considered an argument against massively parallel computing. Is it wrong?
Strided references: a sequence of non-contiguous memory references with a regular stride. Very common for reading a row of a matrix when it is stored by columns (or a column when stored by rows).
Gather: move a sparse group of locations into a contiguous vector (such as after a strided read); Scatter is to put them back

3d torus, 64 nodes implies 4x4x4, implies average hops = 3 (1/dimension)
Shared memory: large global address space, but not caching
Goal: bandwidth, low overhead, low latency
Latency tolerance (instead of caching)
Load/stores used to enable very fast message passing
Barrier/Eureka network in hardware, but not separate nework, just special packets. Eureka is used to indicate that at least one node has met a condition (versus all for barrier)
E-registers:

enable larger address space (128MB vs 8MB)
translation on both sides (into and out of global virtual addresses)
translation include cetrifuge for complex data layouts
enables atomic memory operations
full/empty bit on each one. Note: "RAW hazard" means a read-after-write conflict, ie. reading something whose correct value is still being written
Little's law determines how many to have: given target bandwith, and (current implementation's) latency => how many you need to keep the network busy

Message Passing: remote enqueue with either interrupts or polling
Fuzzy (split-phase) barriers
Does not exploit locality (much), hurts irregular apps such as MP3D (in Alewife paper)
Some cases caching hurts, such as producer/consumer (remote write bring data locally, which is then pulled back by the consumer)

Alewife

directory based (list of nodes with a copy, single writer, multiple readers). Hardware holds up to four in list, the rest in software

block multithreading: four contexts, single cycle context switch
Switch on cache miss or Full/empty error
scheduler decides which four threads to load (load/unload = traditional context switch)
livelock case: switch on cache miss, then other thread throws out the value before you switch back
prefetching (hint only)

more portable?
must explicitly partition the program
Simpler parallelism: encapsulation provided by local memory (ie. separate namespaces)
Easier to optimize automatically
easier to detect parallel

easier migration of legacy programs, but need heavy tuning
simpler model for programmers, but hard to optimize (must explicitly partition to optimize)
very hard to recover from faults: every load/store could raise an exception
hard to predict performance (cached/local/remote?)