Mariposa

Advanced Topics in Computer Systems
Joe Hellerstein & Eric Brewer

Mariposa: An Agoric Distributed DBMS for WANs

Theme: to scale DDBMS to 1000's of sites, need loose coupling. This can be done with an economic paradigm for resource allocation: query optimization (scheduling) and data placement.

Bad things about old DDBMSs:

Static data allocation
Single administrative structure
Homogeneity of hardware, software, networks

Goals of Mariposa:

Scale to 1000's of sites
Data Mobility
Local autonomy
Easily and autonomously configurable policies

Mechanism:

Economic (Agoric) paradigm (agora = marketplace)
Allows point-to-point decision-making to work in a more global context of resource utilization

removes need for centralized decision-making (e.g. global optimizer)
in essence, implicit aggregate information (i.e. feedback) is passed around the network, resulting in rational behavior

Some important cautionary notes:

Economics is just a useful metaphor for doing decentralized resource allocation.
This decentralized scheme will not be optimal. Goal is to scale at the expense of controlling optimality.
One can spend years arguing about policy, and trying to tune the system by changing policy.
Mariposa implemented mechanisms, but did very little research on policy.
Premature discussions of policy can be VERY distracting -- first put mechanism in place, THEN play!
In short, it's instructive to study Mariposa without any economics ... there are lessons there. Then think about economics; there are fewer lessons there from Mariposa, but of course many interesting questions.

Architecture and Life of a Query

Data layout: horizontally partitioned table fragments (logical or not), and replicas (of varying freshness)
3-tier architecture: clients, middleware, local site manager. Mariposa requires local sites to run Postgres.
Query is generated by application, with accompanying "bid curve" ($/Delay).
Query planning is a two-phase optimizer a la XPRS, but with decoupled costing. It works more or less as follows:

optimizer (middleware) runs Selinger as if it were a local single-site query (this is phase 1. the rest is phase 2.)
fragmenter (middleware) breaks resulting query plan into pieces, arranges pieces in trivially parallelizable "strides" (which are NOT pipelined together, for no apparent reason)
broker (middleware) sends out Requests For Bid (RFBs) to sites that might be interested (more on this later)
bidder (local site) returns a "bid" for a piece of work, consisting of triple (Cost,Delay,Expiration)
coordinator (middleware) accepts bids, constructs a final plan, and informs local sites of their jobs

Query processing is controlled by the coordinator, and the local executors
Question: can't we have decouple costing, but do better than two-phase?

Amol Deshpande's MS studied this problem in Cohera
Design space: separate cost model into full-power (known vs. unknown) and runtime-power/willingness (static vs. dynamic) parts.

	Exhaustive	Heuristic Pruning	Two-Phase	Randomized
Dynamic/Unknown Costs
Static/Unknown Costs	Garlic
Dynamic/Known Costs	Parametric Opt		Mariposa
Static/Known Costs	R*	IDP		Simulated Annealing, etc.

Want to be in the uppermost row.
Idea 1 -- upper left box: do Exhautive (Selinger with RPCs for cost estimation. Too many rounds of messages (exponential!). I.e. you have to worry about the cost of costing.

If you're going to allow local executors to reoptimize their own work, then you don't prune anything per subquery (subset of relations).
So, can generate all the needed cost-estimation requests as a batch, based on query graph (relationships between tables)
One round of messages and you get Selinger (i.e. can achieve upper left box!)
Still, the added complexity of site placement can be painful

Idea 2: Upper second-to-left box

"Iterative Dynamic Programming" (IDP, Kossman, et al.)
Do part of the DP step table (say up to k-way joins) with batched, distributed costing
Then bid that out, prune to one k-way subset of relations, and start over

Idea 3: Upper second-to-right

Mariposa + Garlic. Two-phaseness still inherent.

Idea 4: Upper rightmost box

Unfortunately, no way to batch costing in randomized, since you "move" randomly in the space

Results:

Exhaustive works great if you can afford it. Message costs still high since messages become big.
IDP very sensitive to the parameter k. But one of 3 or 4 works well for reasonable size plans, so can do both 3 and 4.
Mariposa actually works quite well except when its assumptions about what's "known" are really wrong (e.g. presence of materialized views)

Details of Bidding

Budget B(t) (bid curve) a non-increasing function of time
Broker "bids out" single-table subplans or joins of two fragments
Two protocols: the Long Protocol ("expensive bid") and the Short Protocol ("purchase order")

Long protocol is as described above
In short protocol, heuristics determine where to send work orders (e.g. scans at storage sites)

Choosing among bids:

they do this stride-by-stride, but sum of all strides delay must be <= B(D)
note that there's parallelism within a stride, so delay for a stride's bid collection is MAX of delays of the bids in the collection (whereas cost is SUM of costs)
ideally, want to pick the "lowest point below the bid curve" -- i.e. maximize difference = B(D) - C
greedy heuristic used in Mariposa:

start with minimal-delay bid collection for each stride.
for each longer-delay collection, compute cost gradiant: (decrease in cost)/(increase in time) if we switch to this collection
swap in maximum cost-gradiant alternative if difference increases. Recompute cost gradiants.
repeat until difference will not be improved
Papadimitriou and Yannakakis study this problem theoretically as a simple example of multi-objective optimization:

they can approximate in poly time the entire Pareto boundary (all points that are not dominated in both dimensions)
i.e. they can draw the bid curve for you, let you choose!

This is described as a "bottom-up" strategy. An alternative "top-down" strategy bids out large sub-pieces, which can in turn be busted up by the bidders and "subcontracted" into separate pieces

Discussion: backing off from the details, what is the plan space, what is "bidding" for, and what about other approaches?

Autonomous sites require autonomous cost estimation

crash, "opt out", heterogenous hardware, parameterized decision-making (by requestor, by time of day, by work type, by load, etc.)
optimizer needs to model this

Mariposa does so in advance of (2nd phase of) query planning

Scalability & Availability require adaptive cost estimation

cannot expect to centralize cost estimation in the traditional static way
how do you measure performance in an adaptive system?!?
economics will have a somewhat coarse grained adaptivity (no more than once per query)

vs. other schemes? eddies for federation?

Metadata

How does the broker find sites to get bids? Yellow Pages and advertisements.
Servers can register services (with prices) at a variety of yellow page servers. These are timestamped so that freshness can be considered in bidding.
Discussion of various costing policy issues: sale prices, coupons, bulk purchase contracts, etc.
Don't think any fancy economics got implemented in Mariposa

Costing

Simple: charge for CPU cycles and I/O bandwidth, translate to a delay via site-specific multiplicative factor
Scale delay or cost by current machine load: gives a simple (though crude) system-wide feedback for load-balancing. An example of economics taking the place of a centralized load leveler
Can set pricing per fragment (why?)
They talk about network bandwidth reservation, though this was never implemented (and is not clearly a win in the network world)

Storage

Sites can buy and sell fragments; access history required to judge value of fragments
To run a subquery, Mariposa requires the site to buy any fragments that aren't resident; this may require evicting (selling) other fragments. Note lack of pipelining causing a problem here.
No discussion here of copies; there was a great deal of thought about how sites could buy copies and contract for updates at varying rates. In turn, queries could reason about the age of copies. Since they built on Postgres, time-travel could allow for consistent (though perhaps dated) query results regardless of the copies chosen.
Fragments could get too big or too small, and splitting/coalescing fragments was seen as an important optimization issue. Either this could be determined by economics, or by a more direct mechanism.

Names & Nameservice

Mariposa used a different naming scheme than R*.
Really no reason to think it's better than R*'s

Status

Mariposa worked well enough for Jeff Sidell to run some TPC-D queries on half a dozen machines. The simple load-balancing pricing strategy easily adapted to changing workloads.
Running queries across the internet resulted in very unpredictable delays. This undercuts the economic model? See Franklin's "Query Scrambling" work tomorrow.
Cohera Corp

remove Postgres, map to arbitrary SQL systems
add an SQL interface to a web screen scraper and XML parser (Select * from inktomi where keywords = "free text")
add support for IR within SQL: keyword indexing (Alta Vista), synonyms, fuzzy matching (n-grams)
add user-centric tools for managing heterogeneity: mapping and integration (see Potter's Wheel, http://control.cs.berkeley.edu)
add materialized views
bid curves are not interesting to most users -- small collection of good-behavior, load-balancing bidding policies for sites, no budgets for queries

now what's the role of economics?

speed up the fast path
pick a first app! e-catalogs. branch out from there.

interesting discussion: in the B2B space, transactions (EAI, workflow, messaging) have been more successful than queries. Why?