CS262A: Advanced Topics in Computer Systems
Eric Brewer, based on notes by Joe Hellerstein

Volcano & the Exchange Operator

There are a host of techniques for parallelizing particular query operators (e.g. hash join, sorting, etc.), but what you really need is to parallelize your query engine in a clean, uniform way.

Volcano's Solution: encapsulate the parallelism in a query operator of its own, not in the QP infrastructure.

Overview: kinds of intra-query parallelism available:

pipeline
partition, with two subcases:
- intra-operator parallelism (e.g. parallel hash join, or parallel sort)
- inter-operator parallelism -- bushy trees

We want to enable all of these -- including setup, teardown, and runtime logic -- in a clean encapsulated way.

The exchange operator: an operator you pop into any single-site dataflow graph as desired -- anonymous to the other operators.

Implementation:

Note: Volcano was done with processes, but today you'd use threads
splits the graph into two threads. The lower thread has an X-OUT iterator at the top. The upper thread has an X-IN iterator at the botton.
X-OUT is a driver for the lower iterator. Says next() a bunch of times, constructs a packet, pushes that packet via IPC or network comm onto a queue in X-IN's "port". X-IN responds to next() when it has tuples in its queue.
Flow control: semaphore on the port dictates the maximum degree to which the producer can get ahead of the consumer. This is akin to a bounded queue.
Note that introducing a queue allows a push producer to work with a pull consumer. The queue allows a bounded drift in their rates of production, beyond which one side is blocking/polling.

Benefits of exchange:

opaquely handles setup and teardown of clones (in an SMP...for shared-nothing, would need to have daemons at each site, and a protocol to request clone spawning)
at the top of a local subplan, allows pipeline parallelism: turns iterator-based, unithreaded "pull" into network-based, cross-thread "push".
- Why is push beneficial?
- at the top of a local subplan, allows decoupling of children's scheduling.
- inside a subplan, can mix pull and push to get the best of both

"Extensibility" features of Volcano and exchange:

operators don't interpret records, support functions do; goes for partitioning as well

There were a couple subsequent extensions to Exchange:

Graefe has another paper on exchange in Transactions on Software Engineering which provides more gory details on startup and teardown of clones. It's not all that pretty, unfortunately.
The River project at Berkeley revisited partitioned parallelism with an eye toward adaptive load balancing for state-agnostic operators (each data item can go into any consumer partition).
FLuX (Fault-tolerant, Load-Balancing eXchange) was an effort at Berkeley to extend Exchange to add what the name says.
Google MapReduce basically applies partition parallelism to a simple dataflow pipeline, and demonstrates broad applicability in their workloads. It also includes simple fault-tolerance and load balancing techniques -- simpler than River or Flux, yet effective enough for their workloads. A related paper on Sawzall proposes a "little language" for this environment, which should be contrasted with a dataflow or query language.

Food for thought:

As we'll see again, encapsulating communication/flow details is a good programming paradigm. Is there a broader lesson here for asynchronous programming models, and particularly distributed or parallel programming models? Volcano was in fact supposed to be targeted at more general parallel programming, though they only made a few steps in that direction.
Note that an optimizer chooses where to pop exchange ops into a plan. What does this suggest in concert with the above point? If we program right, can this kind of decision be made by an optimizer in more general programming tasks? Can query optimization be brought to bear in more generic contexts?
What about eddies and exchange -- can we make the use of exchange operators dynamic, and dynamically control the points and degrees of parallelism?

Eddies

Starting point: observes that a query optimizer is an adaptive system with a very slow feedback loop:

Observe environment: daily/weekly (runstats)
Use observations to choose behavior: query optimization
Take action: query execution

There are reasons to believe this is way too slow. People have looked at more intelligent things (see survey article for more detail):

Per-query adaptivity: piggyback statistics-gathering on query execution. [Chen/Roussopoulos 94].
Runtime sampling: Take samples of the database right at runtime to estimate costs. (many papers starting in the late 80's)
Runtime "competition": for initial phase of a query, try multiple plans, and then choose the best alternative. [Antoshenkov, DEC RDB, 96]. Only used for base-table access method selection.
Inter-operator adaptivity: Place a materialization operator in the plan. Re-optimize after the materialization operator runs. e.g. [Kabra/DeWitt98]
Adaptive operators: Some good work was done on making big operators (e.g. hashjoin, sort) adaptive to changing memory availability. e.g. [Pang/Carey/Livny 93]. Also some work on making join algorithms for interactive query systems that favor one input or the other based on user feedback [Haas/Hellerstein 97]
Adaptive partitioning: River [Arpaci/etal. 99] adaptively decided how to partition a dataflow a la Exchange.

Eddies were an effort to subsume a bunch of this stuff using the design spirit of Exchange: encapsulate the decisions in a dataflow operator. An eddy allows for adaptive reordering of a subtree of dataflow operators on a tuple by tuple basis (or slower, of course). Here's the idea:

The eddy is "wired up" so that its inputs are the inputs to the subtree, its output is the output of the subtree, and all the operators in the subtree are both inputs and outputs of the eddy. Note that an eddy can service all the operators in a plan, or it can just provide flexibility for a subset of operators.
The eddy is parameterized by a partial ordering on the operators, which tells it which operators must precede which in the dataflow, and which ones can be mutually reordered.
If all the operators are pipelining (in the "Joe Hellerstein rule" sense of producing tuples while consuming), then the eddy gets to:
1. Observe the rates of production/consumption for operators
2. Choose the order of operators that each tuple visits.
In essence, the choice of the dataflow graph edges has been replaced by a "routing policy".
The eddy operator can therefore serve as a single encapsulated place for control logic (in the sense of control theory).
To ensure that each tuple visits each operator at most once and in an order consistent with the partial order given, a "steering vector" of ready/done bits is attached to each tuple to guide

Note a vague similarity to INGRES' optimization scheme, which also could change join orders "per tuple" in some sense. At the architectural level, that's all fine and dandy. But many questions remain. Some basic ones:

Many people with competing schemes accused eddies of overkill and efficiency: isn't the overhead of all this tuple massaging too high? You can't really want or need to adapt on a per-tuple basis? This was addressed with a simple batching scheme described in a short paper with a performance study in Postgres.
What is an optimal routing policy? How do you even define the problem? This is tricky. It's useful to start with the simpler problem of eddies over unary filters -- i.e. selections or key-based index joins (e.g. web lookups). Even here the problem is tricky, and depends how you define it. For a stable data distribution, an approximation algorithm was developed There's a natural though imperfect analogy to "n-arm bandit" problems. There are also some complexity results and worst-case bounds for different models of correlation among predicates. An interesting heuristic appeared in VLDB this year [Babu 2005]

More complicated questions revolving around joins remain:

Is it always OK to mess with tuple routing among joins, or can that give you wrong answers? Moments of symmetry was an intuitive, informal handle on that.
Join output requires feeds on both inputs. The initial delay problem in the paper.
Can we enable the join algorithm to change, or are we limited to the join ordering alone?
Joins carry a "burden of history": once potentially joinable tuples have been sent to separate join operators, there is no way to create any output product using those tuples with a join order that combines them first. E.g. the initial delay problem is like this: S tuples were sent to the join of R and S, T tuples were sent to the join of S and T. If S tuples could be easily filtered by T and are expensive to join with R, once the R's come in it's too late to change your mind.

Many of these problems were subsequently addressed by choosing a different granularity of dataflow operator. Instead of using eddies and joins, you expose the "state modules" ("STeMs") (hashtables, b-trees) from the join directly to the eddy -- in essence you expose the join algorithm's internals to the eddy. This idea led to:

competition and hybridization of multiple join algorithms (hash join and index join) at runtime [Raman, 2003]
user-controllable partial results from queries[Raman, 2002]
solutions to the initial delay and "burden of history" problems (via STAIRS) [Deshpande, 2004]

The end result of this was that we tore apart traditional relational query processing and optimization and reexamined it. However, we certainly did not put it back together (yet)! The set of new variables exposed introduces a bunch of complexity, and naturally reopens buried chestnuts like dealing with dependencies in data and predicates. Much remains to be done here! The question is relevance: one can come up with many scenarios where adaptivity helps a lot, but are any of them enough to rearchitect a DBMS?

My take: maybe not in the traditional DBMS market. Maybe in the brave new world of software dataflow for other tasks, e.g. network routing!