CS262A: Advanced Topics in Computer Systems
Eric
Brewer, based on notes by Joe Hellerstein
Volcano & the Exchange Operator
There are a host of techniques for parallelizing particular query
operators (e.g. hash join, sorting, etc.), but what you really need is
to parallelize your query engine in a clean, uniform way.
Volcano's Solution: encapsulate the parallelism in a query operator
of its own, not in the QP infrastructure.
Overview: kinds of intra-query parallelism available:
-
pipeline
-
partition, with two subcases:
-
intra-operator parallelism (e.g. parallel hash
join, or parallel sort)
-
inter-operator parallelism -- bushy trees
We want to enable all of these -- including setup, teardown, and runtime
logic -- in a clean encapsulated way.
The exchange operator: an operator you pop into any single-site
dataflow graph as desired -- anonymous to the other operators.
Implementation:
-
Note: Volcano was done with processes, but today
you'd use threads
-
splits the graph into two threads. The lower
thread has an X-OUT iterator at the top. The upper thread has an X-IN
iterator at the botton.
-
X-OUT is a driver for the lower iterator. Says next() a bunch of
times, constructs a packet, pushes that packet via IPC or network comm onto a
queue in X-IN's "port". X-IN responds to next() when it has tuples in
its queue.
-
Flow control: semaphore on the port dictates the
maximum degree to which the producer can get ahead of the consumer. This is
akin to a bounded queue.
- Note that introducing a queue allows a push producer to work with
a pull consumer. The queue allows a bounded drift in their
rates of production, beyond which one side is blocking/polling.
Benefits of exchange:
-
opaquely handles setup and teardown of clones (in an
SMP...for shared-nothing, would need to have daemons at each site, and a
protocol to request clone spawning)
-
at the top of a local subplan, allows pipeline
parallelism: turns iterator-based, unithreaded "pull" into network-based,
cross-thread "push".
- Why is push beneficial?
-
at the top of a local subplan, allows decoupling of
children's scheduling.
-
inside a subplan, can mix pull and push to get the best of both
"Extensibility" features of Volcano and exchange:
-
operators don't interpret records, support functions do; goes for
partitioning as well
There were a couple subsequent extensions to Exchange:
- Graefe
has another
paper on exchange in Transactions on Software Engineering which provides more gory details on startup and teardown of
clones. It's not all that pretty, unfortunately.
- The River project at Berkeley revisited partitioned
parallelism with an eye toward adaptive load balancing for state-agnostic
operators (each data item can go into any consumer partition).
- FLuX
(Fault-tolerant,
Load-Balancing eXchange) was an effort at Berkeley to
extend Exchange to add what the name says.
- Google
MapReduce basically applies partition parallelism to a
simple dataflow pipeline, and demonstrates broad applicability
in their workloads. It also includes simple fault-tolerance
and load balancing techniques -- simpler than River or Flux, yet
effective enough for their workloads. A related paper
on Sawzall
proposes a "little language" for this environment, which should
be contrasted with a dataflow or query language.
Food for thought:
-
As we'll see again, encapsulating communication/flow
details is a good programming paradigm. Is there a broader lesson here for
asynchronous programming models, and particularly distributed or parallel
programming models? Volcano was in fact supposed to be targeted at more
general parallel programming, though they only made a few steps in that
direction.
-
Note that an optimizer chooses where to pop exchange
ops into a plan. What does this suggest in concert with the above point? If we
program right, can this kind of decision be made by an optimizer in more
general programming tasks? Can query optimization be brought to bear in more
generic contexts?
-
What about eddies and exchange -- can we make the use of exchange operators
dynamic, and dynamically control the points and degrees of parallelism?
Eddies
Starting point: observes that a query optimizer is an adaptive system
with a very slow feedback loop:
- Observe environment: daily/weekly (runstats)
- Use observations to choose behavior: query
optimization
- Take action: query execution
There are reasons to believe this is way too slow. People have looked
at more intelligent things
(see survey
article for more detail):
- Per-query adaptivity:
piggyback statistics-gathering on query execution. [Chen/Roussopoulos 94].
- Runtime sampling: Take
samples of the database right at runtime to estimate costs. (many papers
starting in the late 80's)
- Runtime "competition":
for initial phase of a query, try multiple plans, and then choose the best
alternative. [Antoshenkov, DEC RDB, 96]. Only used for base-table access
method selection.
- Inter-operator adaptivity: Place a materialization operator in the plan. Re-optimize
after the materialization operator runs. e.g. [Kabra/DeWitt98]
- Adaptive operators: Some
good work was done on making big operators (e.g. hashjoin, sort) adaptive to
changing memory availability. e.g. [Pang/Carey/Livny 93]. Also some work on
making join algorithms for interactive query systems that favor one input or
the other based on user feedback [Haas/Hellerstein 97]
- Adaptive partitioning: River [Arpaci/etal. 99] adaptively
decided how to partition a dataflow a la Exchange.
Eddies were an effort to subsume a bunch of this stuff using the
design spirit of Exchange: encapsulate the decisions in a dataflow
operator. An eddy allows for adaptive reordering of a subtree of
dataflow operators on a tuple by tuple basis (or slower, of
course). Here's the idea:
- The eddy is "wired up" so that its inputs are the inputs to the
subtree, its output is the output of the subtree, and all the
operators in the subtree are both inputs and outputs of the eddy.
Note that an eddy can service all the operators in a plan, or
it can just provide flexibility for a subset of operators.
- The eddy is parameterized by a partial ordering on the operators, which tells it
which operators must precede which in the dataflow, and which ones can be
mutually reordered.
- If all the operators are pipelining (in the "Joe Hellerstein rule"
sense of producing tuples while consuming), then the eddy gets to:
- Observe the rates of production/consumption for
operators
- Choose the order of operators that each tuple visits.
- In essence, the choice of the dataflow graph edges has been replaced by
a "routing policy".
- The eddy operator can therefore serve as a single
encapsulated place for control logic (in the sense of control theory).
- To ensure that each tuple visits each operator at
most once and in an order consistent with the partial order given, a "steering
vector" of ready/done bits is attached to each tuple to guide
Note a vague similarity to INGRES' optimization scheme, which also
could change join orders "per tuple" in some sense.
At the architectural level, that's all fine and dandy. But many
questions remain.
Some basic ones:
- Many people with competing schemes accused eddies of overkill and
efficiency: isn't the overhead of all this tuple massaging too high?
You can't really want or need to adapt on a per-tuple basis? This
was addressed with
a simple
batching scheme described in a short paper
with a performance study in Postgres.
- What is an optimal routing policy? How do you even define the
problem? This is tricky. It's useful to start with the simpler
problem of eddies over unary filters -- i.e. selections or key-based
index joins (e.g. web lookups). Even here the problem is tricky,
and depends how you define it. For a stable data distribution, an
approximation algorithm was developed
There's a natural though imperfect analogy to "n-arm bandit"
problems. There are also some complexity results and worst-case
bounds for different models of correlation among predicates. An
interesting heuristic appeared in VLDB this year
[Babu
2005]
More complicated questions revolving around joins remain:
- Is it always OK to mess with tuple routing among joins, or can
that give you wrong answers? Moments of symmetry was an intuitive, informal handle on that.
- Join output requires feeds on both inputs. The
initial delay problem in the paper.
- Can we enable the join algorithm to change, or are we limited to the join
ordering alone?
- Joins carry a "burden of history": once potentially joinable
tuples have been sent to separate join operators, there is no way to
create any output product using those tuples with a join
order that combines them first. E.g. the initial delay problem is
like this: S tuples were sent to the join of R and S, T tuples were
sent to the join of S and T. If S tuples could be easily filtered by
T and are expensive to join with R, once the R's come in it's too
late to change your mind.
Many of these problems were subsequently addressed by choosing a
different granularity of dataflow operator. Instead of using eddies
and joins, you expose the "state modules" ("STeMs") (hashtables,
b-trees) from the join directly to the eddy -- in essence you expose
the join algorithm's internals to the eddy. This idea led to:
- competition and hybridization of multiple join algorithms (hash
join and index join) at runtime [Raman, 2003]
- user-controllable partial results from
queries[Raman,
2002]
- solutions to the initial delay and "burden of history" problems
(via STAIRS)
[Deshpande, 2004]
The end result of this was that we tore apart traditional relational
query processing and optimization and reexamined it. However, we
certainly did not put it back together (yet)! The set of new
variables exposed introduces a bunch of complexity, and naturally
reopens buried chestnuts like dealing with dependencies in data and
predicates. Much remains to be done here! The question is
relevance: one can come up with many scenarios where adaptivity helps
a lot, but are any of them enough to rearchitect a DBMS?
My take: maybe not in the traditional DBMS market. Maybe in the brave
new world of software dataflow for other tasks, e.g. network routing!