Parallelism & Gamma

Background

Parallelism research happened along multiple tracks. OS/Compilers/Scientific community was one track in the 80's. Parallel DBMS was another. It's worth noting that mainframes were parallel computers long before it was sexy...though the parallelism was used mostly just for multitasking.

A pattern in parallel systems research?

Explore specialized hardware to give better performance through parallelism
Move parallelism ideas into software running over commodity hardware
Along the way, better understanding of algorithms, pounding down of architectural bottlenecks
Drive for performance and scale leads to reliability/maintainability research

The database machine
(Boral/DeWitt '83, "Database Machines: An Idea Whose Time has Passed?")
All mixed up: trying to make storage devices faster, and do "more of the work". Something of a hodgepodge of extra processors and novel storage devices, and combinations of the two.

Processor Per Track (PPT): Examples: CASSM, RAP, RARES. Goal: each processor sees random-access storage, no seeks or indexes (super-parallelism). Too expensive. Wait for bubble memory, charge-couple devices (CCDs)? Still waiting...
PP Head: Examples: DBC, SURE. Helps with selection queries. Avoids the transfer of data across the I/O channel. Also cylinder-per-revolution for maximum parallelism -- parallel readout disks (read all heads at once). This has become increasingly difficult over time ("settle" is increasingly tricky as disks get faster and smaller).
Off-the-disk DB Machines: Examples: DIRECT, RAP.2, RDBM, DBMAC, INFOPLEX. Precursors of today's parallel DBs. DIRECT: A controller CPU, and special-purpose query processing CPUs with shared disks and memory (shared everything!).

1981: Britton-Lee Database Machine. A Sun box with a special processor. They had their own OS and compiler for that processor.
"There was 1 6Mhz Z8000 Database processor, up to 4 Z8000 communications processors. Up to 4 "Intellegent" disk controllers (they could optimize their own sector schedule, I think). I think you could put up to 16 .5Gig drives on the thing and probably 4-6 Megs of memory. About a year later we upped the DBP to 10Mhz (yeilding a mighty 1 mips). " -- Mike Ubell

All failed. Why?

these don't help much with sort, join, etc. Only with easy stuff. Not clear whether it addressed the performance bottlenecks that existed in the hardware of the time -- even less clear today.
Important lesson: special-purpose hardware is a losing proposition

prohibitively expensive (no economy of scale)
slow to evolve
requires a tool set

Is it time to revisit all this??

IDISK @ Berkeley tried recently. Processor + big cache per disk. Similarly, Jim Gray talks about "cyberbricks". But isn't it all just "shared nothing"? Wait and see below...

Parallel DB 101

Performance metrics:

Speedup x=old_time/new_time.
Scaleup. small_sys_elapsed_small_prob/big_sys_elapse_big_prov

Transaction scaleup: N times as many TPC-C’s for N machines
Batch scaleup: N times as big a query for N machines

2 kinds of data parallelism

pipelined (most are short in traditional QP)
partition

3 barriers to linearity:

startup overheads
interference: usually the result of unpredictable communication delays (comm cost, empty pipelines)
skew

3 basic architectures.

shared-memory
shared-disk
shared-nothing

Ask yourself about:

ease of programming
cost of equipment (and size of its user base)
reliability/availability (both MTTF and MTTR)
who controls resources and how
performance goals: esp. latency vs. bandwidth. Where's your system bottleneck?
maintenance: utilities, DB design, admin wizardry

The Case for Shared Nothing

DeWitt et al., GAMMA DB Machine

Gamma: "shared-nothing" multiprocessor system
Other shared-nothing systems: Bubba (MCC), Volcano (Colorado), Teradata (now owned by NCR), Tandem, Informix, IBM DB2 Parallel Edition
Gamma Version 1.0 (1985)

shared nothing: token ring connecting 20 VAX 11/750's.
Eight processors had a disk each.
Architectural tuning issues:

Token Ring packet size was 2K, so they used 2K disk blocks. Mistake.
Bottleneck at Unibus (1/20 the bandwidth of the network itself, slower than disk!)
Network interface also a bottleneck – only buffered 2 incoming packets
Only 2M RAM per processor, no virtual memory
This becomes a pattern in the literature: architectural system balance

Version 2.0 (1989)

Intel iPSC-2 hypercube. (32 386s, 8M RAM each, 1 333M disk each.)
networking provides 8 full duplex reliable channels at a time
small messages are datagrams, larger ones form a virtual circuit which goes away at EOT
Usual multiprocessor story: tuned for scientific apps

OS supports few heavyweight processes
Solution: write a new OS! (NOSE) – more like a thread package
SCSI controller transferred only 1K blocks

Other complaints

want disk-to-memory DMA. As of now, 10% of cycles wasted copying from I/O buffer, and CPU constantly interrupted to do so (13 times per 8K block!)

"The Future", circa 1990: CM-5, Intel Touchstone Sigma Machine.
"The Future", circa 1995: cluster with "SAN" (Myrinet, Gigabit ethernet, etc.)
"The Future", circa 2000: ??
How does this differ from a distributed dbms?

no notion of site autonomy
centralized schema
all queries start at "host"
assumption of very high bandwidth

Storage organization: all relations are "horizontally partitioned" across all disk drives in four ways:

round robin: default, used for all query outputs
hashed: randomize on key attributes
range partitioned (specified distribution), stored in range table for relation
range partitioned (uniform distribution – sort and then cut into equal pieces.)
Within a site, store however you wish.
indices created at all sites
A better idea: heat (Bubba?)

hotter relations get partitioned across more sites
why is this better?

Note: all this adds to the administration of a DBMS. Yuck.

"primary multiprocessor index" identifies where a tuple resides.

Query Execution

An operator tree constructed, each node assigned one or more operator processes at each site.

Thread pool architecture: operator processes never die

Hash-based algorithms only for join & grouping.
"Left-deep" trees (I call this "right-deep". Think of it as "building-deep")

pipelines at most 2 joins deep (probe lower, build higher)
this was done because of lack of RAM
"right-deep" (my "left-deep", or "probing-deep") is better -- with sufficient spindles and k*sqrt(N) RAM, get full parallelism in build, one-pipeline probe during hybrid hash bucket 0 above.
Doubly-pipelined hash join invented for pipelined parallelism in main memory, recently extended to out-of-core case.

Life of a query:

parsed and optimized at host (Query Manager)
if a single-site query, send to that site for execution
otherwise sent to a dispatcher process (admission control)
dispatcher process gives query to scheduler process
scheduler process passes pieces to operator processes at the sites
results sent back to scheduler who passes them to Query Manager for display

More detail:

the example query in the paper
split table used to route result tuples to appropriate processors. Three types:

Hashed
Range-Partitioned.
Round-Robin.

selection: uses indices, compiled predicates, prefetching.
join: basic idea is to use hybrid hash, with one bucket per processor.

tuples corresponding to each logical bucket should fit in the aggregate mem of participating processors
each logical bucket after the first is split across all disks (may be diskless processors)

aggregation: done piecewise. Groups accumulated at individual nodes for final result.

Question: can all aggs be done piecewise?

updates: as usual, unless it requires moving a tuple (when?)
control messages: 3x as many as operators in the plan tree

scheduler: Initiate
operator: ID of port to talk to
operator: Done, can reuse thread
That’s all the coordination needed!!v

Concurrency and Recovery

CC: 2PL with bi-granularity locking (file & page). Centralized deadlock detection. (Another distinction with distributed)
ARIES-based recovery, static assignment of processors to log sites.

Availability, Fault Tolerance: Chained Declustering

nodes belong to relation clusters
relations are declustered within a relation cluster
backups are declustered "one disk off"
tolerates failure of a single disk or processor
discussion in paper vs. interleaved declustering

these kinds of coding issues have been beaten to death since
note that RAID does this for files (sequences), declustering does it for relations (sets). Sets are much nicer to work with.

glosses over how indexes are handled in case of failure (another paper)

Performance Results

Simplified Picture: Gamma gets performance by...

Running multiple small queries in parallel at (hopefully) disjoint sites.
Running big queries over multiple sites --- idea here is:

Logically partition problem so that each subproblem is independent, and
Run the partitions in parallel, one per processor.

Examples: hash joins, sorting, ...

Performance results:

a side-benefit of declustering is that small relations mean fewer & smaller seeks (?? WHAT??)
big queries get linear speedup

not perfect, but amazingly close

pretty constant scaleup
some other observations

hash-join goes a bit faster if already partitioned on join attribute

but not much! Redistribution of tuples isn’t too expensive -- bottleneck not in comm.

as you add processors, you lose benefits of short-circuited messages, in addition to incurring slight overhead for the additional processes

Missing Research Issues (biggies!):

query optimization (scheduling, query rewriting for subqueries)
load balancing: inter-query parallelism with intra-query parallelism
disk striping, reliability, etc.
online admin utilities
database design
skew handling for non-standard data types

Some Themes in Parallel DBs

That distinguish them from other parallel programming tasks.

Hooray for the relational model

apps don't change when you parallelize system (physical data independence!). can tune, scale system without informing apps too
ability to partition records arbitrarily, w/o synchronization
lack of pointers means no need for low-latency transfer of data
instead of pointer-chasing, batch partitioning + joins.....THIS IS GENERALIZABLE!

essentially no synchronization except setup & teardown

no barriers, cache coherence, etc.
DB transactions work fine in parallel

data updated in place, with 2-phase locking transactions
replicas managed only at EOT via 2-phase commit
coarser grain, higher overhead than cache coherency stuff

bandwidth much more important than latency

often pump 1-1/n % of a table through the network
aggregate net BW should match aggregate disk BW
bus BW should match about 3x disk BW (NW send, NW receive, disk)
Latency, schmatency. Insignificance makes a BIG difference in what architectures are needed.

shared mem helps with skew

but distributed work queues can solve this (?) (River)

Exchange: Encapsulation of Parallelism

And comm & connecting push/pull.

River

Background: NOW-Sort was a CS286 project that ended up being the world's fastest sorting machine for 2 years, generating a number of papers. It was an amazing feat, but they could only get it to run at record speed late at night, with much hand-holding. The problems were in the (sometimes transient) performance heterogeneity of machines.

Some computers in a cluster are faster than others
Some disks in a cluster are faster than others
Even if you declare that your cluster will be homogeneous, it won't!

Example 1: some machine may have a stray job running on it, which slows it down
Example 2: some disk may have a scratch file on the outer rings, which causes the throughput available to other apps to be as bad as 2x slower

So, a cluster-based I/O intensive system should be tolerant to performance heterogeneity. River was an attempt to do that. It used two mechanisms:

For balancing different rates of data consumption, a distributed queue (DQ) allows workers to consume data at varying rates (like two people on a date, "slurping a single milkshake from two straws"). Comes from shared-memory work.
For balancing different rates of data production, the graduated declustering (GD) scheme generalized gamma's "chained declustering" to handle slow-downs, not just failures, and ensure that each disk feeds its natural share of the data.

The DQ mechanism is based on a very simple randomized "push" scheme, in which each producer randomly picks a destination for the next datum, subject to a constraint on the number of outstanding unconsumed items at each destination. Note that DQ's are encapsulated into the equivalent of local "exchange" operators (DQsend modules and DQrecv modules)

The GD mechanism uses a more sophisticated feedback mechanism for each producer to generate data at the appropriate relative rates.