Advanced Topics in Computer Systems Joe Hellerstein & Eric Brewer
Mariposa: An Agoric Distributed DBMS for WANs
Theme: to scale DDBMS to 1000's of sites, need loose coupling. This
can be done with an economic paradigm for resource allocation: query optimization
(scheduling) and data placement.
Bad things about old DDBMSs:
Static data allocation
Single administrative structure
Homogeneity of hardware, software, networks
Goals of Mariposa:
Scale to 1000's of sites
Data Mobility
Local autonomy
Easily and autonomously configurable policies
Mechanism:
Economic (Agoric) paradigm (agora = marketplace)
Allows point-to-point decision-making to work in a more global context
of resource utilization
removes need for centralized decision-making (e.g. global optimizer)
in essence, implicit aggregate information (i.e. feedback) is passed
around the network, resulting in rational behavior
Some important cautionary notes:
Economics is just a useful metaphor for doing decentralized resource allocation.
This decentralized scheme will not be optimal. Goal is to scale at
the expense of controlling optimality.
One can spend years arguing about policy, and trying to tune the system
by changing policy.
Mariposa implemented mechanisms, but did very little research on policy.
Premature discussions of policy can be VERY distracting -- first put mechanism
in place, THEN play!
In short, it's instructive to study Mariposa without any economics
... there are lessons there. Then think about economics; there
are fewer lessons there from Mariposa, but of course many interesting questions.
Architecture and Life of a Query
Data layout: horizontally partitioned table fragments (logical or not),
and replicas (of varying freshness)
3-tier architecture: clients, middleware, local site manager. Mariposa
requires local sites to run Postgres.
Query is generated by application, with accompanying "bid curve" ($/Delay).
Query planning is a two-phase optimizer a la XPRS, but with decoupled
costing. It works more or less as follows:
optimizer (middleware) runs Selinger as if it were a local single-site
query (this is phase 1. the rest is phase 2.)
fragmenter (middleware) breaks resulting query plan into pieces,
arranges pieces in trivially parallelizable "strides" (which are NOT pipelined
together, for no apparent reason)
broker (middleware) sends out Requests For Bid (RFBs) to sites that
might be interested (more on this later)
bidder (local site) returns a "bid" for a piece of work, consisting
of triple (Cost,Delay,Expiration)
coordinator (middleware) accepts bids, constructs a final plan,
and informs local sites of their jobs
Query processing is controlled by the coordinator, and the local executors
Question: can't we have decouple costing, but do better than two-phase?
Amol Deshpande's MS studied this problem in Cohera
Design space: separate cost model into full-power (known vs. unknown) and
runtime-power/willingness (static vs. dynamic) parts.
Exhaustive
Heuristic Pruning
Two-Phase
Randomized
Dynamic/Unknown Costs
Static/Unknown Costs
Garlic
Dynamic/Known Costs
Parametric Opt
Mariposa
Static/Known Costs
R*
IDP
Simulated Annealing, etc.
Want to be in the uppermost row.
Idea 1 -- upper left box: do Exhautive (Selinger with RPCs for cost estimation.
Too many rounds of messages (exponential!). I.e. you have to worry
about the cost of costing.
If you're going to allow local executors to reoptimize their own work,
then you don't prune anything per subquery (subset of relations).
So, can generate all the needed cost-estimation requests as a batch, based
on query graph (relationships between tables)
One round of messages and you get Selinger (i.e. can achieve upper left
box!)
Still, the added complexity of site placement can be painful
Idea 2: Upper second-to-left box
"Iterative Dynamic Programming" (IDP, Kossman, et al.)
Do part of the DP step table (say up to k-way joins) with batched, distributed
costing
Then bid that out, prune to one k-way subset of relations, and start over
Idea 3: Upper second-to-right
Mariposa + Garlic. Two-phaseness still inherent.
Idea 4: Upper rightmost box
Unfortunately, no way to batch costing in randomized, since you "move"
randomly in the space
Results:
Exhaustive works great if you can afford it. Message costs still
high since messages become big.
IDP very sensitive to the parameter k. But one of 3 or 4 works well
for reasonable size plans, so can do both 3 and 4.
Mariposa actually works quite well except when its assumptions about what's
"known" are really wrong (e.g. presence of materialized views)
Details of Bidding
Budget B(t) (bid curve) a non-increasing function of time
Broker "bids out" single-table subplans or joins of two fragments
Two protocols: the Long Protocol ("expensive bid") and the Short Protocol
("purchase order")
Long protocol is as described above
In short protocol, heuristics determine where to send work orders (e.g.
scans at storage sites)
Choosing among bids:
they do this stride-by-stride, but sum of all strides delay must be <=
B(D)
note that there's parallelism within a stride, so delay for a stride's
bid collection is MAX of delays of the bids in the collection (whereas
cost is SUM of costs)
ideally, want to pick the "lowest point below the bid curve" -- i.e. maximize
difference
= B(D) - C
greedy heuristic used in Mariposa:
start with minimal-delay bid collection for each stride.
for each longer-delay collection, compute cost gradiant: (decrease
in cost)/(increase in time) if we switch to this collection
swap in maximum cost-gradiant alternative if difference increases.
Recompute cost gradiants.
they can approximate in poly time the entire Pareto boundary (all points
that are not dominated in both dimensions)
i.e. they can draw the bid curve for you, let you choose!
This is described as a "bottom-up" strategy. An alternative "top-down"
strategy bids out large sub-pieces, which can in turn be busted up by the
bidders and "subcontracted" into separate pieces
Discussion: backing off from the details, what is the plan space,
what is "bidding" for, and what about other approaches?
cannot expect to centralize cost estimation in the traditional static way
how do you measure performance in an adaptive system?!?
economics will have a somewhat coarse grained adaptivity (no more than
once per query)
vs. other schemes? eddies for federation?
Metadata
How does the broker find sites to get bids? Yellow Pages and
advertisements.
Servers can register services (with prices) at a variety of yellow page
servers. These are timestamped so that freshness can be considered
in bidding.
Discussion of various costing policy issues: sale prices, coupons, bulk
purchase contracts, etc.
Don't think any fancy economics got implemented in Mariposa
Costing
Simple: charge for CPU cycles and I/O bandwidth, translate to a delay via
site-specific multiplicative factor
Scale delay or cost by current machine load: gives a simple (though crude)
system-wide feedback for load-balancing. An example of economics
taking the place of a centralized load leveler
Can set pricing per fragment (why?)
They talk about network bandwidth reservation, though this was never implemented
(and is not clearly a win in the network world)
Storage
Sites can buy and sell fragments; access history required to judge value
of fragments
To run a subquery, Mariposa requires the site to buy any fragments that
aren't resident; this may require evicting (selling) other fragments.
Note lack of pipelining causing a problem here.
No discussion here of copies; there was a great deal of thought
about how sites could buy copies and contract for updates at varying rates.
In turn, queries could reason about the age of copies. Since they
built on Postgres, time-travel could allow for consistent (though perhaps
dated) query results regardless of the copies chosen.
Fragments could get too big or too small, and splitting/coalescing fragments
was seen as an important optimization issue. Either this could be
determined by economics, or by a more direct mechanism.
Names & Nameservice
Mariposa used a different naming scheme than R*.
Really no reason to think it's better than R*'s
Status
Mariposa worked well enough for Jeff Sidell to run some TPC-D queries on
half a dozen machines. The simple load-balancing pricing strategy
easily adapted to changing workloads.
Running queries across the internet resulted in very unpredictable delays.
This undercuts the economic model? See Franklin's "Query Scrambling"
work tomorrow.
Cohera Corp
remove Postgres, map to arbitrary SQL systems
add an SQL interface to a web screen scraper and XML parser (Select * from
inktomi where keywords = "free text")
add support for IR within SQL: keyword indexing (Alta Vista), synonyms,
fuzzy matching (n-grams)
add user-centric tools for managing heterogeneity: mapping and integration
(see Potter's Wheel, http://control.cs.berkeley.edu)
add materialized views
bid curves are not interesting to most users -- small collection of good-behavior,
load-balancing bidding policies for sites, no budgets for queries
now what's the role of economics?
speed up the fast path
pick a first app! e-catalogs. branch out from there.
interesting discussion: in the B2B space, transactions (EAI, workflow,
messaging) have been more successful than queries. Why?