Goal: a small-scale (dozens of sites) distributed DBMS
1. Location transparency: users don’t know where the data is
2. Performance transparency: performance independent of submission site
3. Copy transparency: objects can be copied, and copies are maintained automatically. Critical for availability
4. Transaction transparency: looks like single-site xacts
5. Fragment “transparency”: tables can be fragmented to different sites
6. Schema change transparency: schema updates at a single site affect global schema
7. Local DBMS transparency: shouldn’t matter what DBMS is running at each site (ha!)
Nobody provides all of this (#7 & #3 are still major research field)
IBM Almaden’s R* was the most “real” DDBMS prototype.
· VTAM/CICS provided communication
· Good optimizer, concurrency control
Others:
· SDD-1: done at CCA, never really ran. PDP-10’s on Arpanet. Strange design decisions based on unreasonably slow network.
· Distributed INGRES: UCB, also never ran. Lack of UNIX networking software (pre TCP/IP!).
· Commercial vendors beginning to provide this functionality
Goals
· location transparency
· local privacy & control
· local performance on local queries
· no central dependencies: site autonomy
· no central catalogs
· no central scheduler
· no central deadlock detector
Tables can be
· dispersed
· replicated
· partitioned:
· horizontal partitioning (by predicate)
· vertical partitioning (but a key at each partition)
· snapshots
Table names: user_name@user_site.object_name@birth_site
Catalogs store info on local objects, and objects born locally (along with pointers to current sites)
Remote catalog info is cached as hints:
· invalid cache entries are detected easily and updated
· catalog entries have version numbers
a catalog entry for a table:
· System-Wide Name
· type info
· access paths
· view definition (if it’s a view)
· optimizer stats
To find a catalog entry:
1. check local catalog & cache
2. check birth site
3. check the place the birth site points to
An overview of Distributed CC:
Every transaction gets an xactno: site_name.time (or SN)
deadlocks: shoot the youngest, largest-numbered xact
Commit Protocol – you can’t just decide to commit:
· some site may not be able to
· you can’t distingush inability from slowness!
Two Phase Commit (2PC):
1. coordinator says to participants: want to commit? (N-1 messages)
2. Participants all vote yea/nay (N-1 messages)
3. coordinator tells everyone to commit/abort (N-1 messages)
4. participants acknowledge message (N-1 messages)
This does the right thing, but it’s pretty costly. We will read about improving this next time.
Distributed Deadlock Detection:
· find and resolve local deadlocks
· pass your waits-for graph around to detect global deadlocks
· we will read more about how to do this efficiently next time
· Motivation for reading this paper: understand issues/alternatives/tradeoffs in distributed query processing.
· Goal is not really to study the validation of the R* optimizer.
· Basic idea in cost evaluation: cost is IOCost + CPUCost + MsgCost + Bit-ShipCost
· New costs for sending messages. Note breakdown into msg overhead & data
· Simple case: non-replicated, horizontally partitioned single table:
Madison: EMP where
employer = ‘UW’
Berkeley: EMP where
employer = ‘UC’
NYC: EMP where employer = ‘NYU’
Optimal plan is Q(T) = where n is # partitions and Q(Ti) is best local plan.
(Can often “semantically”' detect that some partitions will yield null result.)
· Replicated data: just choose best of all plans: Q(T) = Choose-best(Q(T1), … ,Q(Tn)).
· More complicated: n-way join queries. Options include:
1. “Ship Outer”
· Retrieve qualified tuples at outer site.
· ship to inner site.
· join arriving tuples with inner relation (packet by packet.)
· Note: this is pipelined.
2. “Ship Inner”
· Retrieve qualifying (inner) tuples at inner site.
· Send them to outer site.
· Accumulate temp relation at outer site.
· join outer with temp at outer site.
· Note: this is sequential.
3. “Fetch Matches”
· retrieve qualified outer tuples.
· send their join column values to inner site.
· retrieve matching inner tuples.
· return those inner tuples to outer site.
· join them on arrival w/waiting outer tuples.
· repeat until done.
· Mackert & Lohman argue that this never wins
4.“Ship Both”
· Retrieve qualified inner tuples.
· Send to join site.
· accumulate in temp realation.
· retrieve qualified outer tuples.
· send to join site.
· as outer tuples arrive, join with temp.
· (1/2 pipelined, 1/2 sequential.)
· Might actually make sense if next join is at the join site!
5. “Semijoin strategy''
· (Outer) sent to inner site.
· Join inner with this projection, send result to outer site.
· At outer’s site, join outer with result of semi-join.
· SDD-1 used semi-joins exclusively (slow network)
6. “Bloom join”
· Scan outer, producing bit vector.
· send bit vector to inner site.
· use bit vector filter to reduce inner.
· send reduced inner to outer, do join.
· Works pretty well!
Miscellaneous Observations
· They exploit parallelism in various ways
· queries pipeline some of the time
· distributing the work reduces contention at buffers, making optimizer “more accurate”
· Optimizer must incorporate all costs – SDD-1-style assumptions lead to short-sighted design
· No work on heterogeneity, not even different machine speeds or network links!
· No fragmentation or replication