Extensibility & Object-Relational Systems
Extensibility: The Ultimate API Design Challenge
Elegance of extensibility API drives both Performance and Functionality
Q: What's the difference between an API and a language?
Aside: Program as Proof
- A good API is a minimal, sound, complete set of axioms
- Minimality must include performance goals! i.e. functionally specify
your performance needs.
- Example of non-minimal specification: Nemesis.
- Nemesis folks said "servers cause QoS crosstalk, so we won't have servers"
- Real minimal goal was to have accurate accounting to do proportional-share allocation
- They assumes that per-process accounting was a given, rather than designing
an extensible accounting API.
- Getting the right API is perhaps the most tricky part of systems design.
Note the top-down design philosophy!
- Not: how to I wedge in both features X and X' (too specific)
- Yet: such thinking always seems to come up (or you end up over-generalizing)
- Basic tension in extensible systems: features vs. elegance!
- This becomes
a matter of balancing taste/art/maintainability against the desire to satisfy
a short-term need (prove you can do X for marketing reasons -- happens in
research a lot!) or to satisfy a specific customer (who will pay $$ tomorrow
for X).
Related note: typically SW engineering cycle is top-down:
- Brainstorm
- Approval for business reasons
- Functional spec followed by review
- Design spec followed by review (and perhaps redo functional spec)
- Code
- Test (...code...test...code...test...etc.)
Extensible DBMS Motivation & Politics
In the early 80’s, it became clear that Relational systems were not
powerful enough for non-administrative data-intensive applications of the
day:
Can roll the logic in the app, but two key problems:
-
would like function-shipping rather than data-shipping
-
impedance mismatch
Two buzz-phrases began to emerge: "Object-Oriented" and "Extensible"
Databases
Much vision & politics ensued:
-
Various data models (NF2, ER, Functional, Semantic)
-
Object-Oriented DB System Manifesto (OO-ness).
-
Third-Generation DB System Manifesto (Extensibility)
-
Many query languages proposed
Systems were built, companies started, etc.
By now, the dust of battle has cleared:
-
Small market in "Object Oriented DBMSs" (e.g., Persistent program languages
-- EXODUS, ObjectStore, Objectivity, Versant, etc.)
-
Big market in "Object-Relational DBMSs" -- everybody's latest RDBMS version
has some of these features (except MS). Query-based systems with
OO features (e.g. Starburst, Postgres, Illustra, Informix & Oracle
"Universal Servers", DB/2 UDB)
-
Essentially nobody does both well (though O2 got closest).
Object-relational won in the marketplace, and remaining OODB companies
struggling to redefine themselves (ObjectStore now trying to sell an XML
database engine!)
Systems History
Influential research systems:
-
Object-Relational: Starburst (IBM Almaden, now in DB2/UDB) and POSTGRES
(Berkeley, now in Informix UDO)
-
Object-Oriented: Gemstone (Maier at OGI), EXODUS (Wisconsin), Genesis (Texas),
Thor (MIT), SHORE (Wisconsin)
-
Others include O2 (Altair), ORION (MCC), Iris (HP)
Today: focus on POSTGRES, discuss Starburst briefly.
Object-Relational Systems
Informix's buzzphrase for extensible relational systems (courtesy Roger
Sippl). Didn't patent the name!
Query-based, extensible systems with some OO features like inheritance
and OIDs.
SQL99 is very much like this.
Stonebraker’s application matrix:
Query |
RDBMS |
ORDBMS |
No Query |
File Sys. |
OODBMS |
|
Simple Data |
Complex Data |
Argues that the upper right is growing, and will engulf upper left and
lower right.
Hot Applications:
-
Time-series data
-
"Asset Management": i.e. multimedia data
-
GIS
Players:
-
Informix Universal DB (head of Illustra, body of Informix). First to market,
but Informix stumbled. Via multiple M&As, Informix now owns O2
too (via Ardent).
-
IBM DB2 UDB (head of Starburst, body of DB2). Slower, very strong.
-
Oracle 8i. Shipping now. Low-tech approach.
-
NCR (Teradata) bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students)
-
Other big R vendors still late to the game (Sybase, Tandem, MS, etc.)
-
O vendors still not running queries
Overview: Things needed in an Object-Relational DBMS
(From the "Third-Generation Database System Manifesto")
The relational data model (as implemented!) is "semantically impoverished":
-
fixed set of base types (integer, float, etc)
-
only structuring allowed is normal-form relations
-
only operations are relational algebra, using comparators from base types
Instead, people want a "semantically rich" data model
-
extensible ADTs
-
complex types based on type constructors, and methods for those types
-
inheritance
-
"Object identity"
Stonebraker: Extensible ADTs
Seminal paper.
Idea: you should be able to add new atomic types to the system,
along with methods for the types, and new access methods.
Type is defined by:
-
storage size (can be variable)
-
input method
-
output method
-
any other methods a user wishes to provide
Then you could do standard relational processing over those types.
Example: add 2d spatial operators to RDBMS
Engineering issues (some from the paper, some not):
-
Parsing
-
must know about user-defined types and methods (table-driven)
-
Optimization
-
must be able to compute selectivity for user-defined predicates
-
must know about cost of user-defined methods, and consider predicate pullup
-
must know how to match user-defined predicates to fancy new access methods
-
must know whether user-defined join predicates can be evaluated by hash
or merge
-
Execution:
-
must have dynamic linking (24x7 operation)
-
methods called via "function pointers", "functors" or some similar construct
-
support for "untrusted" functions (hot topic these days)
-
support for "large objects"
-
caching for expensive methods (a la subquery caching)
-
user-definable aggregations
-
3 functions: init, iter, end
-
Access Methods
-
Stonebraker: An access method is a generic object that provides
-
open, get-first, get-next, close, insert, delete
-
if it’s fancy, it takes SARGs and evaluates them quickly
-
needs to provide cost estimates to optimizer
-
Problems
-
integration with CC
-
can be solved by physical logging
-
can open up logging interface for AM-specific logging
-
integration with Buffer Manager
-
In practice, almost nobody used the Postgres Access method extensibility.
Similar lack of buy-in to Informix, DB2's and Oracle's extension mechanisms.
Wrong API (5-star wizards required)!
-
GiST!?
-
Buffer Management
-
Large objects require new schemes
-
Transactions
-
integration with access methods
Note: Extensible ADTs do not fundamentally change a relational system
-
they fit naturally with the relational model
-
as Stonebraker shows, they fit naturally into RDBMS query processing
-
for relational "believers" this is all you should need to solve all problems
POSTGRES
Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.
The Postgres Data Model
-
Co-opt the OO terminology
-
class = relation
-
instance = tuple
-
object-id = tuple-id
-
method = attribute or function of attributes
-
Support extensible ADTs
-
extensible procedures using C functions
-
binary operators, which interface to extensible AM
-
Support type constructors
-
trick: use queries
-
columns can be parameterized Postquel functions (returns setof, or tuple)
-
queries can live in fields of a tuple (returns setof or tuple)
-
another exploitation of the view paradigm!
-
these derived objects can optionally be cached (never implemented)
-
nested-dots used to traverse complex object structures
-
leverages EXISTING techniques for relational processing. Cute!
-
added array support directly
-
added class inheritance (gives method inheritance and collection hierarchies)
Implementation Details
-
originally written in LISP, then ran Lisp2C, resulting in a horrible built-in
inheritance mechanism over C
-
only access methods added to Postgres were done "in house"
-
B-tree and R-tree early
-
linear hashing late
-
GiST added in the last couple years
-
"Fast Path" to AMs, as an alternative to "Persistent X"
-
never well-documented or used outside Berkeley -- difference between an
extensibility API and a hack.
-
ADTs as described above
-
No Overwrite Storage, time travel, etc
-
Research project was "shut down" in 94. 2 Berkeley students did a
major cleanup (remove lispisms, remove a number of theses), migrated to
SQL, and released Postgres95. This was picked up by freeware hackers
on the net, and now PostgreSQL seems to be the serious freeware db of choice
(www.postgresql.org).
Postgres Rule System: Active Database support.
-
One of the recurring themes of extensible DBMS work. (And shows up
in limited ways in OS/Nets and they don't know it -- keep your eyes peeled!)
-
Quick primer on rule systems:
-
An old AI idea (Production Rule Systems like OPS5)
-
Active DB version: Event-Condition-Action (ECA) rules
-
Events: read, insert, update
-
Conditions: arbitrary queries return true.
-
Actions: more queries.
-
Many semantic issues: what to do when 2 rules are triggered (conflict resolution),
what are the transaction semantics (immediate, nested, deferred)
-
Many implementation alternatives: locks, query rewrite, execution
-
Current DBMSs all support a limited version of this (via triggers),
customers want more, more, more.
-
But AI people know that rule systems were a failure.
-
A little rules goes a long way?
Other Postgres stuff
-
Shared-mem parallel version, with new optimization techniques
-
"Inversion" file system
-
PICASSO UI (Rowe & students)
-
Support for tertiary memory
-
Method Indexing
-
Partial Indexes
-
Expensive predicate optimization
-
Commercialized as Illustra
-
Concepts ported into Informix
Editorial Comments
-
Postgres code was a mess that got partially cleaned up over time.
-
Data model was sloppy but clever. Burst some OO bubbles.
-
Roger King: "My cat is object-oriented"
-
Missed an important distinction between Class and Collection.
-
No-overwrite storage was not as bad as you think. Expect the concept to
resurface.
-
Successful research project
-
written off as goofy research, then drove the whole DB industry forward
(argument for tech transfer via startup)
-
ADT extensibility & dynamic linking very useful. Nested object
stuff less useful (though XML may drive it finally)
-
Many radical ideas in one system! Not all worked, but ambitious.
Starburst
Original goal: build a nice playpen for whatever comes next.
Extensible "in-house". Not by users!
No one survey paper seems to capture the work they did. Best bet: "Starburst
Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990
Extensibility features:
-
User-defined functions:
-
table expressions: queries or C functions
-
scalar functions
-
no dynamic linking
-
Rule-based Query rewrite engine
-
a little internal rule system
-
conditions and actions are C functions that check and change QGM
-
some nifty rule control mechanisms (rule classes, rule budgets, multiple
conflict res.)
-
Rule-based query optimizer
-
specify bottom-up combinations as a grammar ("backward-chaining rules")
-
Extensible access methods (as in POSTGRES)
-
"Attachments": routines to be automatically called before/after dealing
with an access method
-
used by Starburst Rule System to generate transition logs
-
used to implement pre-computed joins (see below)
"Proofs" of Extensibility:
-
added "signature" attachment to automatically tag tuples with some derived
values
-
though writing the attachment was hard, hooking it in took only 1 day
-
added Outer Join
-
not so simple – required adding things in QGM, optimizer, and executor
-
IMS attachment
-
pre-computed joins using pointers, which are maintained
-
written by "outsiders" – Mike Carey & Beau Shekita from Wisconsin –
in a summer visit to Almaden
-
2nd Rule System called "Alert", based on infinitely-running queries. Lightweight
and pretty effective.
-
Recursive query processing!
Many of the Starburst folks took a "vacation" from research, and merged
Starburst technology into DB2 UDB.