Parallelism & Gamma
Background
Parallelism research happened along multiple tracks. OS/Compilers/Scientific
community was one track in the 80's. Parallel DBMS was another.
It's worth noting that mainframes were parallel computers long before it
was sexy...though the parallelism was used mostly just for multitasking.
A pattern in parallel systems research?
-
Explore specialized hardware to give better performance through parallelism
-
Move parallelism ideas into software running over commodity hardware
-
Along the way, better understanding of algorithms, pounding down of architectural
bottlenecks
-
Drive for performance and scale leads to reliability/maintainability research
The database machine
(Boral/DeWitt '83, "Database Machines: An Idea Whose Time has Passed?")
All mixed up: trying to make storage devices faster, and do "more of
the work". Something of a hodgepodge of extra processors and novel
storage devices, and combinations of the two.
-
Processor Per Track (PPT): Examples: CASSM, RAP, RARES. Goal: each
processor sees random-access storage, no seeks or indexes (super-parallelism).
Too expensive. Wait for bubble memory, charge-couple devices (CCDs)?
Still waiting...
-
PP Head: Examples: DBC, SURE. Helps with selection queries.
Avoids the transfer of data across the I/O channel. Also cylinder-per-revolution
for maximum parallelism -- parallel readout disks (read all heads at once).
This has become increasingly difficult over time ("settle" is increasingly
tricky as disks get faster and smaller).
-
Off-the-disk DB Machines: Examples: DIRECT, RAP.2, RDBM, DBMAC, INFOPLEX.
Precursors of today's parallel DBs. DIRECT: A controller CPU, and
special-purpose query processing CPUs with shared disks and memory (shared
everything!).
-
1981: Britton-Lee Database Machine. A Sun box with a special processor.
They had their own OS and compiler for that processor.
-
"There was 1 6Mhz Z8000 Database processor, up to 4 Z8000 communications
processors. Up to 4 "Intellegent" disk controllers (they could optimize
their own sector schedule, I think). I think you could put up to 16 .5Gig
drives on the thing and probably 4-6 Megs of memory. About a year later
we upped the DBP to 10Mhz (yeilding a mighty 1 mips). " -- Mike Ubell
All failed. Why?
-
these don't help much with sort, join, etc. Only with easy stuff.
Not clear whether it addressed the performance bottlenecks that existed
in the hardware of the time -- even less clear today.
-
Important lesson: special-purpose hardware is a losing proposition
-
prohibitively expensive (no economy of scale)
-
slow to evolve
-
requires a tool set
Is it time to revisit all this??
-
IDISK @ Berkeley tried recently. Processor + big cache per disk.
Similarly, Jim Gray talks about "cyberbricks". But isn't it all just
"shared nothing"? Wait and see below...
Parallel DB 101
Performance metrics:
-
Speedup x=old_time/new_time.
-
Scaleup. small_sys_elapsed_small_prob/big_sys_elapse_big_prov
-
Transaction scaleup: N times as many TPC-C’s for N machines
-
Batch scaleup: N times as big a query for N machines
2 kinds of data parallelism
-
pipelined (most are short in traditional QP)
-
partition
3 barriers to linearity:
-
startup overheads
-
interference: usually the result of unpredictable communication delays
(comm cost, empty pipelines)
-
skew
3 basic architectures.
-
shared-memory
-
shared-disk
-
shared-nothing
Ask yourself about:
-
ease of programming
-
cost of equipment (and size of its user base)
-
reliability/availability (both MTTF and MTTR)
-
who controls resources and how
-
performance goals: esp. latency vs. bandwidth. Where's your system
bottleneck?
-
maintenance: utilities, DB design, admin wizardry
For an entertaining early take on all this, see Stonebraker's quickie overview
"The Case
for Shared Nothing" (HPTS '85). Interestingly, doesn't cover
the "batch scaleup" problem (probably because of the charter of the HPTS
workshop)
DeWitt et al., GAMMA DB Machine
-
Gamma: "shared-nothing" multiprocessor system
-
Other shared-nothing systems: Bubba (MCC), Volcano (Colorado), Teradata
(now owned by NCR), Tandem, Informix, IBM DB2 Parallel Edition
-
Gamma Version 1.0 (1985)
-
shared nothing: token ring connecting 20 VAX 11/750's.
-
Eight processors had a disk each.
-
Architectural tuning issues:
-
Token Ring packet size was 2K, so they used 2K disk blocks. Mistake.
-
Bottleneck at Unibus (1/20 the bandwidth of the network itself, slower
than disk!)
-
Network interface also a bottleneck – only buffered 2 incoming packets
-
Only 2M RAM per processor, no virtual memory
-
This becomes a pattern in the literature: architectural system balance
-
Version 2.0 (1989)
-
Intel iPSC-2 hypercube. (32 386s, 8M RAM each, 1 333M disk each.)
-
networking provides 8 full duplex reliable channels at a time
-
small messages are datagrams, larger ones form a virtual circuit which
goes away at EOT
-
Usual multiprocessor story: tuned for scientific apps
-
OS supports few heavyweight processes
-
Solution: write a new OS! (NOSE) – more like a thread package
-
SCSI controller transferred only 1K blocks
-
Other complaints
-
want disk-to-memory DMA. As of now, 10% of cycles wasted copying from I/O
buffer, and CPU constantly interrupted to do so (13 times per 8K block!)
-
"The Future", circa 1990: CM-5, Intel Touchstone Sigma Machine.
-
"The Future", circa 1995: cluster with "SAN" (Myrinet, Gigabit ethernet,
etc.)
-
"The Future", circa 2000: ??
-
How does this differ from a distributed dbms?
-
no notion of site autonomy
-
centralized schema
-
all queries start at "host"
-
assumption of very high bandwidth
-
Storage organization: all relations are "horizontally partitioned" across
all disk drives in four ways:
-
round robin: default, used for all query outputs
-
hashed: randomize on key attributes
-
range partitioned (specified distribution), stored in range table for relation
-
range partitioned (uniform distribution – sort and then cut into equal
pieces.)
-
Within a site, store however you wish.
-
indices created at all sites
-
A better idea: heat (Bubba?)
-
hotter relations get partitioned across more sites
-
why is this better?
-
Note: all this adds to the administration of a DBMS. Yuck.
-
"primary multiprocessor index" identifies where a tuple resides.
Query Execution
-
An operator tree constructed, each node assigned one or more operator processes
at each site.
-
Thread pool architecture: operator processes never die
-
Hash-based algorithms only for join & grouping.
-
"Left-deep" trees (I call this "right-deep". Think of it as "building-deep")
-
pipelines at most 2 joins deep (probe lower, build higher)
-
this was done because of lack of RAM
-
"right-deep" (my "left-deep", or "probing-deep") is better -- with sufficient
spindles and k*sqrt(N) RAM, get full parallelism in build, one-pipeline
probe during hybrid hash bucket 0 above.
-
Doubly-pipelined hash join invented for pipelined parallelism in main memory,
recently extended to out-of-core case.
-
Life of a query:
-
parsed and optimized at host (Query Manager)
-
if a single-site query, send to that site for execution
-
otherwise sent to a dispatcher process (admission control)
-
dispatcher process gives query to scheduler process
-
scheduler process passes pieces to operator processes at the sites
-
results sent back to scheduler who passes them to Query Manager for display
-
More detail:
-
the example query in the paper
-
split table used to route result tuples to appropriate processors. Three
types:
-
Hashed
-
Range-Partitioned.
-
Round-Robin.
-
selection: uses indices, compiled predicates, prefetching.
-
join: basic idea is to use hybrid hash, with one bucket per processor.
-
tuples corresponding to each logical bucket should fit in the aggregate
mem of participating processors
-
each logical bucket after the first is split across all disks (may be diskless
processors)
-
aggregation: done piecewise. Groups accumulated at individual nodes for
final result.
-
Question: can all aggs be done piecewise?
-
updates: as usual, unless it requires moving a tuple (when?)
-
control messages: 3x as many as operators in the plan tree
-
scheduler: Initiate
-
operator: ID of port to talk to
-
operator: Done, can reuse thread
-
That’s all the coordination needed!!v
Concurrency and Recovery
-
CC: 2PL with bi-granularity locking (file & page). Centralized deadlock
detection. (Another distinction with distributed)
-
ARIES-based recovery, static assignment of processors to log sites.
Availability, Fault Tolerance: Chained Declustering
-
nodes belong to relation clusters
-
relations are declustered within a relation cluster
-
backups are declustered "one disk off"
-
tolerates failure of a single disk or processor
-
discussion in paper vs. interleaved declustering
-
these kinds of coding issues have been beaten to death since
-
note that RAID does this for files (sequences), declustering does it for
relations (sets). Sets are much nicer to work with.
-
glosses over how indexes are handled in case of failure (another paper)
Performance Results
-
Simplified Picture: Gamma gets performance by...
-
Running multiple small queries in parallel at (hopefully) disjoint sites.
-
Running big queries over multiple sites --- idea here is:
-
Logically partition problem so that each subproblem is independent, and
-
Run the partitions in parallel, one per processor.
-
Examples: hash joins, sorting, ...
-
Performance results:
-
a side-benefit of declustering is that small relations mean fewer &
smaller seeks (?? WHAT??)
-
big queries get linear speedup
-
not perfect, but amazingly close
-
pretty constant scaleup
-
some other observations
-
hash-join goes a bit faster if already partitioned on join attribute
-
but not much! Redistribution of tuples isn’t too expensive -- bottleneck
not in comm.
-
as you add processors, you lose benefits of short-circuited messages, in
addition to incurring slight overhead for the additional processes
Missing Research Issues (biggies!):
-
query optimization (scheduling, query rewriting for subqueries)
-
load balancing: inter-query parallelism with intra-query parallelism
-
disk striping, reliability, etc.
-
online admin utilities
-
database design
-
skew handling for non-standard data types
Some Themes in Parallel DBs
That distinguish them from other parallel programming tasks.
-
Hooray for the relational model
-
apps don't change when you parallelize system (physical data independence!).
can tune, scale system without informing apps too
-
ability to partition records arbitrarily, w/o synchronization
-
lack of pointers means no need for low-latency transfer of data
-
instead of pointer-chasing, batch partitioning + joins.....THIS IS GENERALIZABLE!
-
essentially no synchronization except setup & teardown
-
no barriers, cache coherence, etc.
-
DB transactions work fine in parallel
-
data updated in place, with 2-phase locking transactions
-
replicas managed only at EOT via 2-phase commit
-
coarser grain, higher overhead than cache coherency stuff
-
bandwidth much more important than latency
-
often pump 1-1/n % of a table through the network
-
aggregate net BW should match aggregate disk BW
-
bus BW should match about 3x disk BW (NW send, NW receive, disk)
-
Latency, schmatency. Insignificance makes a BIG difference in what
architectures are needed.
-
shared mem helps with skew
-
but distributed work queues can solve this (?) (River)
Exchange: Encapsulation of Parallelism
And comm & connecting push/pull.
River
Background: NOW-Sort
was a CS286 project that ended up being the world's fastest sorting machine
for 2 years, generating a number of papers. It was an amazing feat,
but they could only get it to run at record speed late at night, with much
hand-holding. The problems were in the (sometimes transient) performance
heterogeneity of machines.
-
Some computers in a cluster are faster than others
-
Some disks in a cluster are faster than others
-
Even if you declare that your cluster will be homogeneous, it won't!
-
Example 1: some machine may have a stray job running on it, which slows
it down
-
Example 2: some disk may have a scratch file on the outer rings, which
causes the throughput available to other apps to be as bad as 2x slower
So, a cluster-based I/O intensive system should be tolerant to performance
heterogeneity. River was an attempt to do that. It used two
mechanisms:
-
For balancing different rates of data consumption, a distributed queue
(DQ) allows workers to consume data at varying rates (like two people
on a date, "slurping a single milkshake from two straws"). Comes
from shared-memory work.
-
For balancing different rates of data production, the graduated declustering
(GD) scheme generalized gamma's "chained declustering" to handle slow-downs,
not just failures, and ensure that each disk feeds its natural share of
the data.
The DQ mechanism is based on a very simple randomized "push" scheme, in
which each producer randomly picks a destination for the next datum, subject
to a constraint on the number of outstanding unconsumed items at each destination.
Note that DQ's are encapsulated into the equivalent of local "exchange"
operators (DQsend modules and DQrecv modules)
The GD mechanism uses a more sophisticated feedback mechanism for each
producer to generate data at the appropriate relative rates.