Extensibility & Object-Relational Systems

Extensibility: The Ultimate API Design Challenge

Elegance of extensibility API drives both Performance and Functionality

Q: What's the difference between an API and a language?

Aside: Program as Proof

A good API is a minimal, sound, complete set of axioms
Minimality must include performance goals! i.e. functionally specify your performance needs.
Example of non-minimal specification: Nemesis.
- Nemesis folks said "servers cause QoS crosstalk, so we won't have servers"
- Real minimal goal was to have accurate accounting to do proportional-share allocation
- They assumes that per-process accounting was a given, rather than designing an extensible accounting API.
Getting the right API is perhaps the most tricky part of systems design.

Note the top-down design philosophy!

Not: how to I wedge in both features X and X' (too specific)
Yet: such thinking always seems to come up (or you end up over-generalizing)
Basic tension in extensible systems: features vs. elegance!
- This becomes a matter of balancing taste/art/maintainability against the desire to satisfy a short-term need (prove you can do X for marketing reasons -- happens in research a lot!) or to satisfy a specific customer (who will pay $$ tomorrow for X).

Related note: typically SW engineering cycle is top-down:

Brainstorm
Approval for business reasons
Functional spec followed by review
Design spec followed by review (and perhaps redo functional spec)
Code
Test (...code...test...code...test...etc.)

Extensible DBMS Motivation & Politics

In the early 80’s, it became clear that Relational systems were not powerful enough for non-administrative data-intensive applications of the day:

CAD/CAM
CASE
GIS
etc.

Can roll the logic in the app, but two key problems:

would like function-shipping rather than data-shipping
impedance mismatch

Two buzz-phrases began to emerge: "Object-Oriented" and "Extensible" Databases

Much vision & politics ensued:

Various data models (NF2, ER, Functional, Semantic)
Object-Oriented DB System Manifesto (OO-ness).
Third-Generation DB System Manifesto (Extensibility)
Many query languages proposed

Systems were built, companies started, etc.

By now, the dust of battle has cleared:

Small market in "Object Oriented DBMSs" (e.g., Persistent program languages -- EXODUS, ObjectStore, Objectivity, Versant, etc.)
Big market in "Object-Relational DBMSs" -- everybody's latest RDBMS version has some of these features (except MS). Query-based systems with OO features (e.g. Starburst, Postgres, Illustra, Informix & Oracle "Universal Servers", DB/2 UDB)
Essentially nobody does both well (though O2 got closest).

Object-relational won in the marketplace, and remaining OODB companies struggling to redefine themselves (ObjectStore now trying to sell an XML database engine!)

Systems History

Influential research systems:

Object-Relational: Starburst (IBM Almaden, now in DB2/UDB) and POSTGRES (Berkeley, now in Informix UDO)
Object-Oriented: Gemstone (Maier at OGI), EXODUS (Wisconsin), Genesis (Texas), Thor (MIT), SHORE (Wisconsin)
Others include O2 (Altair), ORION (MCC), Iris (HP)

Today: focus on POSTGRES, discuss Starburst briefly.

Object-Relational Systems

Informix's buzzphrase for extensible relational systems (courtesy Roger Sippl). Didn't patent the name!

Query-based, extensible systems with some OO features like inheritance and OIDs.

SQL99 is very much like this.

Stonebraker’s application matrix:

Query	RDBMS	ORDBMS
No Query	File Sys.	OODBMS
	Simple Data	Complex Data

Argues that the upper right is growing, and will engulf upper left and lower right.

Hot Applications:

Time-series data
"Asset Management": i.e. multimedia data
GIS

Players:

Informix Universal DB (head of Illustra, body of Informix). First to market, but Informix stumbled. Via multiple M&As, Informix now owns O2 too (via Ardent).
IBM DB2 UDB (head of Starburst, body of DB2). Slower, very strong.
Oracle 8i. Shipping now. Low-tech approach.
NCR (Teradata) bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students)
Other big R vendors still late to the game (Sybase, Tandem, MS, etc.)
O vendors still not running queries

Overview: Things needed in an Object-Relational DBMS

(From the "Third-Generation Database System Manifesto")

The relational data model (as implemented!) is "semantically impoverished":

fixed set of base types (integer, float, etc)
only structuring allowed is normal-form relations
only operations are relational algebra, using comparators from base types

Instead, people want a "semantically rich" data model

extensible ADTs
complex types based on type constructors, and methods for those types
inheritance
"Object identity"

Stonebraker: Extensible ADTs

Seminal paper.

Idea: you should be able to add new atomic types to the system, along with methods for the types, and new access methods.

Type is defined by:

storage size (can be variable)
input method
output method
any other methods a user wishes to provide

Then you could do standard relational processing over those types.

Example: add 2d spatial operators to RDBMS

Engineering issues (some from the paper, some not):

Parsing

must know about user-defined types and methods (table-driven)

Optimization

must be able to compute selectivity for user-defined predicates
must know about cost of user-defined methods, and consider predicate pullup
must know how to match user-defined predicates to fancy new access methods
must know whether user-defined join predicates can be evaluated by hash or merge

Execution:

must have dynamic linking (24x7 operation)
methods called via "function pointers", "functors" or some similar construct
support for "untrusted" functions (hot topic these days)
support for "large objects"
caching for expensive methods (a la subquery caching)
user-definable aggregations

3 functions: init, iter, end

Access Methods

Stonebraker: An access method is a generic object that provides

open, get-first, get-next, close, insert, delete
if it’s fancy, it takes SARGs and evaluates them quickly
needs to provide cost estimates to optimizer

Problems

integration with CC

can be solved by physical logging
can open up logging interface for AM-specific logging

integration with Buffer Manager

In practice, almost nobody used the Postgres Access method extensibility. Similar lack of buy-in to Informix, DB2's and Oracle's extension mechanisms. Wrong API (5-star wizards required)!
GiST!?

Buffer Management

Large objects require new schemes

Transactions

integration with access methods

Note: Extensible ADTs do not fundamentally change a relational system

they fit naturally with the relational model
as Stonebraker shows, they fit naturally into RDBMS query processing
for relational "believers" this is all you should need to solve all problems

POSTGRES

Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.

The Postgres Data Model

Co-opt the OO terminology

class = relation
instance = tuple
object-id = tuple-id
method = attribute or function of attributes

Support extensible ADTs

extensible procedures using C functions
binary operators, which interface to extensible AM

Support type constructors

trick: use queries

columns can be parameterized Postquel functions (returns setof, or tuple)
queries can live in fields of a tuple (returns setof or tuple)
another exploitation of the view paradigm!
these derived objects can optionally be cached (never implemented)
nested-dots used to traverse complex object structures
leverages EXISTING techniques for relational processing. Cute!

added array support directly

added class inheritance (gives method inheritance and collection hierarchies)

Implementation Details

originally written in LISP, then ran Lisp2C, resulting in a horrible built-in inheritance mechanism over C
only access methods added to Postgres were done "in house"

B-tree and R-tree early
linear hashing late
GiST added in the last couple years

"Fast Path" to AMs, as an alternative to "Persistent X"

never well-documented or used outside Berkeley -- difference between an extensibility API and a hack.

ADTs as described above
No Overwrite Storage, time travel, etc
Research project was "shut down" in 94. 2 Berkeley students did a major cleanup (remove lispisms, remove a number of theses), migrated to SQL, and released Postgres95. This was picked up by freeware hackers on the net, and now PostgreSQL seems to be the serious freeware db of choice (www.postgresql.org).

Postgres Rule System: Active Database support.

One of the recurring themes of extensible DBMS work. (And shows up in limited ways in OS/Nets and they don't know it -- keep your eyes peeled!)
Quick primer on rule systems:

An old AI idea (Production Rule Systems like OPS5)
Active DB version: Event-Condition-Action (ECA) rules
Events: read, insert, update
Conditions: arbitrary queries return true.
Actions: more queries.
Many semantic issues: what to do when 2 rules are triggered (conflict resolution), what are the transaction semantics (immediate, nested, deferred)
Many implementation alternatives: locks, query rewrite, execution
Current DBMSs all support a limited version of this (via triggers), customers want more, more, more.
But AI people know that rule systems were a failure.
A little rules goes a long way?

Starburst

Original goal: build a nice playpen for whatever comes next.

Extensible "in-house". Not by users!

No one survey paper seems to capture the work they did. Best bet: "Starburst Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990

Extensibility features:

User-defined functions:

table expressions: queries or C functions
scalar functions
no dynamic linking

Rule-based Query rewrite engine

a little internal rule system
conditions and actions are C functions that check and change QGM
some nifty rule control mechanisms (rule classes, rule budgets, multiple conflict res.)

Rule-based query optimizer

specify bottom-up combinations as a grammar ("backward-chaining rules")

Extensible access methods (as in POSTGRES)
"Attachments": routines to be automatically called before/after dealing with an access method

used by Starburst Rule System to generate transition logs
used to implement pre-computed joins (see below)

"Proofs" of Extensibility:

added "signature" attachment to automatically tag tuples with some derived values

though writing the attachment was hard, hooking it in took only 1 day

added Outer Join

not so simple – required adding things in QGM, optimizer, and executor

IMS attachment

pre-computed joins using pointers, which are maintained
written by "outsiders" – Mike Carey & Beau Shekita from Wisconsin – in a summer visit to Almaden

2nd Rule System called "Alert", based on infinitely-running queries. Lightweight and pretty effective.
Recursive query processing!

Many of the Starburst folks took a "vacation" from research, and merged Starburst technology into DB2 UDB.