Distributed DBMS: Overview and Query Processing

Goal: a small-scale (dozens of sites) distributed DBMS

1. Location transparency: users don’t know where the data is

2. Performance transparency: performance independent of submission site

3. Copy transparency: objects can be copied, and copies are maintained automatically. Critical for availability

4. Transaction transparency: looks like single-site xacts

5. Fragment “transparency”: tables can be fragmented to different sites

6. Schema change transparency: schema updates at a single site affect global schema

7. Local DBMS transparency: shouldn’t matter what DBMS is running at each site (ha!)

Nobody provides all of this (#7 & #3 are still major research field)

IBM Almaden’s R* was the most “real” DDBMS prototype.

· VTAM/CICS provided communication

· Good optimizer, concurrency control

Others:

· SDD-1: done at CCA, never really ran. PDP-10’s on Arpanet. Strange design decisions based on unreasonably slow network.

· Distributed INGRES: UCB, also never ran. Lack of UNIX networking software (pre TCP/IP!).

· Commercial vendors beginning to provide this functionality

The R* Distributed DBMS

Goals

· location transparency

· local privacy & control

· local performance on local queries

· no central dependencies: site autonomy

· no central catalogs

· no central scheduler

· no central deadlock detector

Tables can be

· dispersed

· replicated

· partitioned:

· horizontal partitioning (by predicate)

· vertical partitioning (but a key at each partition)

· snapshots

Table names: user_name@user_site.object_name@birth_site

Catalogs store info on local objects, and objects born locally (along with pointers to current sites)

Remote catalog info is cached as hints:

· invalid cache entries are detected easily and updated

· catalog entries have version numbers

a catalog entry for a table:

· System-Wide Name

· type info

· access paths

· view definition (if it’s a view)

· optimizer stats

To find a catalog entry:

1. check local catalog & cache

2. check birth site

3. check the place the birth site points to

An overview of Distributed CC:

Every transaction gets an xactno: site_name.time (or SN)

deadlocks: shoot the youngest, largest-numbered xact

Commit Protocol – you can’t just decide to commit:

· some site may not be able to

· you can’t distingush inability from slowness!

Two Phase Commit (2PC):

1. coordinator says to participants: want to commit? (N-1 messages)

2. Participants all vote yea/nay (N-1 messages)

3. coordinator tells everyone to commit/abort (N-1 messages)

4. participants acknowledge message (N-1 messages)

This does the right thing, but it’s pretty costly. We will read about improving this next time.

Distributed Deadlock Detection:

· find and resolve local deadlocks

· pass your waits-for graph around to detect global deadlocks

· we will read more about how to do this efficiently next time

Mackert & Lohman: Distributed Query Optimization & Processing

· Motivation for reading this paper: understand issues/alternatives/tradeoffs in distributed query processing.

· Goal is not really to study the validation of the R* optimizer.

· Basic idea in cost evaluation: cost is IOCost + CPUCost + MsgCost + Bit-ShipCost

· New costs for sending messages. Note breakdown into msg overhead & data

· Simple case: non-replicated, horizontally partitioned single table:

Madison: EMP where employer = ‘UW’

Berkeley: EMP where employer = ‘UC’

NYC: EMP where employer = ‘NYU’

Optimal plan is Q(T) = where n is # partitions and Q(T_i) is best local plan.

(Can often “semantically”' detect that some partitions will yield null result.)

· Replicated data: just choose best of all plans: Q(T) = Choose-best(Q(T₁), … ,Q(T_n)).

· More complicated: n-way join queries. Options include:

1. “Ship Outer”

· Retrieve qualified tuples at outer site.

· ship to inner site.

· join arriving tuples with inner relation (packet by packet.)

· Note: this is pipelined.

2. “Ship Inner”

· Retrieve qualifying (inner) tuples at inner site.

· Send them to outer site.

· Accumulate temp relation at outer site.

· join outer with temp at outer site.

· Note: this is sequential.

3. “Fetch Matches”

· retrieve qualified outer tuples.

· send their join column values to inner site.

· retrieve matching inner tuples.

· return those inner tuples to outer site.

· join them on arrival w/waiting outer tuples.

· repeat until done.

· Mackert & Lohman argue that this never wins

4.“Ship Both”

· Retrieve qualified inner tuples.

· Send to join site.

· accumulate in temp realation.

· retrieve qualified outer tuples.

· send to join site.

· as outer tuples arrive, join with temp.

· (1/2 pipelined, 1/2 sequential.)

· Might actually make sense if next join is at the join site!

5. “Semijoin strategy''

· (Outer) sent to inner site.

· Join inner with this projection, send result to outer site.

· At outer’s site, join outer with result of semi-join.

· SDD-1 used semi-joins exclusively (slow network)

6. “Bloom join”

· Scan outer, producing bit vector.

· send bit vector to inner site.

· use bit vector filter to reduce inner.

· send reduced inner to outer, do join.

· Works pretty well!

Miscellaneous Observations

· They exploit parallelism in various ways

· queries pipeline some of the time

· distributing the work reduces contention at buffers, making optimizer “more accurate”

· Optimizer must incorporate all costs – SDD-1-style assumptions lead to short-sighted design

· No work on heterogeneity, not even different machine speeds or network links!

· No fragmentation or replication