Extensibility & Object-Relational Systems

DBMS vs. OS Extensibility

OS extensibility raises the frightening question "what is an operating system"

is it UNIX with hooks to allow control over certain policies?
is it a hardware abstraction with the minimum functionality to do protection and accounting? Why is that "extensibility", and not "minimality"?
David Boies and Bill Gates did little to clarify the question.
"If you have to ask, you'll never know." -- Louis Armstrong, when asked "What is Jazz?"

In the end, OS extensibility is hard to pin down because OSes have no functional spec.

By contrast, relational databases had a clear functional spec, SQL. After its initial success, there was interest in bringing the functionality to new apps not met by the current spec. This all got tied up in a war about object-orientedness (the DB community's history of models and languages became necessarily intwined with the system issues; OOness was hot back then.) Resulted in two research thrusts:

Object-Relational DBMSs:Inject more functionality "through" the relational spec, while keeping things declarative and high-performance.
Object-Oriented DBMSs: Take an OO programming language, add persistence and transactions, and some declarative features. ("Making Smalltalk a Database System", Copeland & Maier, SIGMOD 1984.)

Extensibility: The Ultimate API Design Challenge

Elegance of extensibility API drives both Performance and Functionality

Q: What's the difference between an API and a language?

Program as Proof, API as axioms

A good API is a minimal, sound, complete set of axioms
Minimality must include performance goals! i.e. functionally specify your performance needs.
Example of non-minimal specification: Nemesis.

Nemesis folks said "servers cause QoS crosstalk, so we won't have servers"
Real minimal goal was to have accurate accounting to do proportional-share allocation
They assumed that per-process accounting was a given, rather than designing an extensible accounting API.

Getting the right API is perhaps the most tricky part of systems design.

Note the top-down design philosophy!

What are the basic hooks needed to do all variants of feature X that one could imagine?

or more fundamentally, what is feature X really about, in a fundamental, variant-independent way?
a wonderful question to ask -- the best part of doing extensibility research is that you clean up ideas.

Not: how to I wedge in variant X1 and X2 (too specific, bottom-up)
Yet: such thinking always seems to come up

Basic engineering tension in extensible systems: flexibility vs. elegance

Overgeneralizing drives you to a meaningless LCD. The turing machine is NOT a nice extensibility API.
"Things should be made as a simple as possible -- but no simpler" -- Einstein

Why isn't extensibility more popular in industry?

Balancing taste/evolvability/manageability against the desire to satisfy a short-term need (prove you can do X for marketing reasons -- happens in research a lot!) or to satisfy a specific customer (who will pay $$ tomorrow for X).
Big systems have very few "users" who are knowledgable enough to deal with extensibility features. As a result, each such user is irrelevant unless they're huge. And if they're huge, they merit a custom hack.
Extensibility therefore hard to market, hard to monetize....a long-term investment.

What's Wrong with Relational?

In the early 80’s, it became clear that Relational systems were not powerful enough for non-administrative data-intensive applications of the day:

CAD/CAM
CASE
GIS
etc.

Can roll the logic in the app, but two key problems:

would like function-shipping rather than data-shipping
impedance mismatch

The relational data model (as implemented!) is "semantically impoverished":

fixed set of base types (integer, float, etc)
only structuring allowed is normal-form relations
only operations are relational algebra, using comparators from base types

Instead, people want a "semantically rich" data model

extensible ADTs
complex types based on type constructors, and methods for those types
inheritance
"Object identity"

Much vision & politics ensued:

Various data models (NF2, ER, Functional, Semantic)
Object-Oriented DB System Manifesto (OO-ness).
Third-Generation DB System Manifesto (Extensibility)
Many query languages proposed

Systems were built, companies started, etc.

By now, the dust of battle has cleared:

Small market in "Object Oriented DBMSs" (e.g., Persistent program languages -- EXODUS, ObjectStore, Objectivity, Versant, etc.)
Big market in "Object-Relational DBMSs" -- everybody's latest RDBMS version has some of these features (except MS). Query-based systems with OO features (e.g. Starburst->DB2, Postgres->Illustra->Informix, Oracle)
Essentially nobody does both well (though O2 got closest).

Or has it? XML looks kind of like a (sloppy, embarrassing) nested object representation, and XPath/XSL looks like the declarative features in some of the OODBs. ObjectStore (Excelon) and Poet now trying to sell XML engines!

Systems History

Influential research systems:

Object-Relational: Starburst (IBM Almaden, now in DB2) and POSTGRES (Berkeley, now in Informix)
Object-Oriented: Gemstone (Maier at OGI), EXODUS (Wisconsin), Genesis (Texas), Thor (MIT), SHORE (Wisconsin)
Others include ObjectStore (Object Design), O2 (Altair), ORION (MCC), Iris (HP)

Today: focus on POSTGRES, discuss Starburst briefly.

Object-Relational Systems

Informix's buzzphrase for extensible relational systems (courtesy Roger Sippl). Didn't patent the name!

Query-based, extensible systems with some OO features like ADTs, inheritance, complex types, OIDs.

SQL99 is very much like this.

Stonebraker’s application matrix:

Query	RDBMS	ORDBMS
No Query	File Sys.	OODBMS
	Simple Data	Complex Data

Argues that the upper right is growing, and will engulf upper left and lower right.

Some Major Research Challenges in Obj-Rel Extensibility

ADTs and efficient declarative queries -- suddenly the semantics are encapsulated even to the DBMS! Major theme of Postgres. Many characteristic extensible API issues -- what MUST you know, what can you generalize without knowledge of specifics?

How to efficiently index opaque types (recall that indexing = storage = CC&R!)
How to optimize queries with opaque methods

expensive methods
new algebraic operators and transformations (extensible optimizers)

Complex Objects and efficiency. A mix of extensibility features with "more hard problems in query optimization and indexing".

How to teach optimizer about new operators and their legal transformations. (Exodus/Volcano, Starburst).

How to efficiently process queries on nested sets, arrays, etc.

How to do data modeling in this world.

Applications:

Time-series data
Non-real-time multimedia data (digital assets, fingerprints, signatures, check images, etc.)
GIS
Complex web stuff (though got to market too slowly!)
Scientific apps

Players:

Informix (merged Illustra's extensibility features, + GiST??). First to market, but Informix stumbled. Via multiple M&As, Informix now owns O2 too (via Ardent).
IBM DB2. Slower to ship, very strong.
Oracle 8i & 9. Shipping now. Low-tech loose integration approach solves the marketing problem, works for the low end.
NCR (Teradata) bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students). Seem to have killed it.
Other big R vendors still late to the game (Sybase, Tandem, MS, etc.)
O vendors still not running queries, fleeing the field of battle. Can they fight Oracle et al. on an XML front? (Doubt it!)

Stonebraker: Extensible ADTs

Seminal paper.

Idea: you should be able to add new atomic types to the system, along with methods for the types, and new access methods.

Type is defined by:

storage size (can be variable)
input method
output method
any other methods a user wishes to provide

Then you could do standard relational processing over those types.

Example: add 2d spatial operators to RDBMS

Engineering issues (some from the paper, some not):

Parsing

must know about user-defined types and methods (table-driven)

Optimization

must be able to compute selectivity for user-defined predicates
must know about cost of user-defined methods, and consider predicate pullup
must know how to match user-defined predicates to fancy new access methods
must know whether user-defined join predicates can be evaluated by hash or merge

Execution:

must have dynamic linking (24x7 operation)
methods called via "function pointers", "functors" or some similar construct
support for "untrusted" functions (hot topic that DB people ignored to their loss)
support for "large objects"
caching for expensive methods (a la subquery caching)
user-definable aggregations

3 functions: init, iter, end

Access Methods

Stonebraker: An access method is a generic object that provides

open, get-first, get-next, close, insert, delete
if it’s fancy, it takes SARGs and evaluates them quickly
needs to provide cost estimates to optimizer

Problems

integration with CC

can be solved by physical logging
can open up logging interface for AM-specific logging

integration with Buffer Manager

In practice, almost nobody used the Postgres Access method extensibility. Similar lack of buy-in to Informix, DB2's and Oracle's extension mechanisms. Wrong API (5-star wizards required)!
GiST!?

Buffer Management

Large objects require new schemes

Transactions

integration with access methods

Note: Extensible ADTs do not fundamentally change a relational system

they fit naturally with the relational model
as Stonebraker shows, they fit naturally into RDBMS query processing
for relational "believers" this is all you should need to solve all problems

POSTGRES

Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.

The Postgres Data Model

Co-opt the OO terminology

class = relation
instance = tuple
object-id = tuple-id
method = attribute or function of attributes

Support extensible ADTs

extensible procedures using C functions
binary "operators", which interface to extensible AM

Support type constructors

trick: use queries

columns can be parameterized Postquel functions (returns setof, or tuple)
queries can live in fields of a tuple (returns setof or tuple)
another exploitation of the view paradigm!
these derived objects can optionally be cached (never implemented)
nested-dots used to traverse complex object structures
leverages EXISTING techniques for relational processing. Cute!

added array support directly

added class inheritance (gives method inheritance and collection hierarchies)

Implementation Details

originally written in LISP, then ran Lisp2C, resulting in a horrible built-in inheritance mechanism over C
only access methods added to Postgres were done "in house"

B-tree and R-tree early
linear hashing late
GiST added in the last couple years

"Fast Path" to AMs, as an alternative to "Persistent X"

never well-documented or used outside Berkeley -- difference between an extensibility API and a hack.

ADTs as described above
No Overwrite Storage, time travel, etc
Research project was "shut down" in 94. 2 Berkeley students did a major cleanup (remove lispisms, remove a number of theses), migrated to SQL, and released Postgres95. This was picked up by freeware hackers on the net, and now PostgreSQL seems to be the serious freeware db of choice (www.postgresql.org).

Postgres Rule System: Active Database support.

One of the recurring themes of extensible DBMS work. (And shows up in limited ways in OS/Nets and they often don't know it -- keep your eyes peeled! Rule systems are easy to specify and hard to reason about-- dangerous!)
Quick primer on rule systems:

An old AI idea (Production Rule Systems like OPS5 from 70's)
Active DB version: Event-Condition-Action (ECA) rules
Events: read, insert, update
Conditions: arbitrary queries return true.
Actions: more queries.
Many semantic issues: what to do when 2 rules are triggered (conflict resolution), what are the transaction semantics (immediate, nested, deferred)
Many implementation alternatives: locks, query rewrite, execution
Current DBMSs all support a limited version of this (via triggers), customers want more, more, more.
But AI people know that rule systems were a failure.
A little rules goes a long way?

Starburst

Original goal: build a nice playpen for whatever comes next.

Extensible "in-house". Not by users!

No one survey paper seems to capture the work they did. Best bet: "Starburst Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990

Extensibility features:

User-defined functions:

table expressions: queries or C functions
scalar functions
no dynamic linking

Rule-based Query rewrite engine

a little internal rule system
conditions and actions are C functions that check and change QGM
some nifty rule control mechanisms (rule classes, rule budgets, multiple conflict res.)

Rule-based query optimizer

specify bottom-up combinations as a grammar ("backward-chaining rules")

Extensible access methods (as in POSTGRES)
"Attachments": routines to be automatically called before/after dealing with an access method

used by Starburst Rule System to generate transition logs
used to implement pre-computed joins (see below)

"Proofs" of Extensibility:

added "signature" attachment to automatically tag tuples with some derived values

though writing the attachment was hard, hooking it in took only 1 day

added Outer Join

not so simple ? required adding things in QGM, optimizer, and executor

IMS attachment

pre-computed joins using pointers, which are maintained
written by "outsiders" ? Mike Carey & Beau Shekita from Wisconsin ? in a summer visit to Almaden

2nd Rule System called "Alert", based on infinitely-running queries. Lightweight and pretty effective.
Recursive query processing!

Many of the Starburst folks took a "vacation" from research, and merged Starburst technology into DB2 UDB.