Extensibility & Object-Relational Systems
DBMS vs. OS
Extensibility
OS extensibility raises the frightening question "what
is an operating
system"
-
is it UNIX with hooks to allow control
over certain policies?
-
is it a hardware abstraction with the
minimum functionality to do protection
and accounting? Why is that "extensibility", and not "minimality"?
-
David Boies and Bill Gates did little to clarify the question.
-
"If you have to ask, you'll never know." -- Louis Armstrong, when asked "What
is Jazz?"
In the end, OS extensibility is hard to pin down because OSes have no functional
spec.
By contrast, relational databases had a clear functional spec, SQL.
After its initial success, there was interest in bringing the functionality
to new apps not met by the current spec. This all got tied up in
a war about object-orientedness (the DB community's history of models and
languages became necessarily intwined with the system issues; OOness
was hot back then.) Resulted
in two research thrusts:
-
Object-Relational DBMSs:Inject more functionality "through" the
relational spec, while keeping things declarative and high-performance.
-
Object-Oriented DBMSs: Take an OO programming language, add persistence
and transactions, and some declarative features. ("Making Smalltalk
a Database System", Copeland & Maier, SIGMOD 1984.)
Extensibility: The Ultimate API Design Challenge
Elegance of extensibility API drives both Performance and Functionality
Q: What's the difference between an API and a language?
Program as Proof, API as axioms
-
A good API is a minimal, sound, complete set of axioms
-
Minimality must include performance goals! i.e. functionally specify your
performance needs.
-
Example of non-minimal specification: Nemesis.
-
Nemesis folks said "servers cause QoS crosstalk, so we won't have servers"
-
Real minimal goal was to have accurate accounting to do proportional-share
allocation
-
They assumed that per-process accounting was a given, rather than designing
an extensible accounting API.
-
Getting the right API is perhaps the most tricky part of systems design.
Note the top-down design philosophy!
-
What are the basic hooks needed to do all variants of feature X that one
could imagine?
-
or more fundamentally, what is feature X really about, in a fundamental,
variant-independent way?
-
a wonderful question to ask -- the best part of doing extensibility research
is that you clean up ideas.
-
Not: how to I wedge in variant X1 and X2 (too specific, bottom-up)
-
Yet: such thinking always seems to come up
Basic engineering tension in extensible systems: flexibility vs. elegance
-
Overgeneralizing drives you to a meaningless LCD. The turing machine
is NOT a nice extensibility API.
-
"Things should be made as a simple as possible -- but no simpler"
-- Einstein
Why isn't extensibility more popular in industry?
-
Balancing taste/evolvability/manageability against the desire to satisfy
a short-term need (prove you can do X for marketing reasons -- happens
in research a lot!) or to satisfy a specific customer (who will pay $$
tomorrow for X).
-
Big systems have very few "users" who are knowledgable enough to deal with
extensibility features. As a result, each such user is irrelevant
unless they're huge. And if they're huge, they merit a custom hack.
-
Extensibility therefore hard to market, hard to monetize....a long-term
investment.
What's Wrong with Relational?
In the early 80’s, it became clear that Relational systems were not powerful
enough for non-administrative data-intensive applications of the day:
Can roll the logic in the app, but two key problems:
-
would like function-shipping rather than data-shipping
-
impedance mismatch
The relational data model (as implemented!) is "semantically impoverished":
-
fixed set of base types (integer, float, etc)
-
only structuring allowed is normal-form relations
-
only operations are relational algebra, using comparators from base types
Instead, people want a "semantically rich" data model
-
extensible ADTs
-
complex types based on type constructors, and methods for those types
-
inheritance
-
"Object identity"
Much vision & politics ensued:
-
Various data models (NF2, ER, Functional, Semantic)
-
Object-Oriented DB System Manifesto (OO-ness).
-
Third-Generation DB System Manifesto (Extensibility)
-
Many query languages proposed
Systems were built, companies started, etc.
By now, the dust of battle has cleared:
-
Small market in "Object Oriented DBMSs" (e.g., Persistent program languages
-- EXODUS, ObjectStore, Objectivity, Versant, etc.)
-
Big market in "Object-Relational DBMSs" -- everybody's latest RDBMS version
has some of these features (except MS). Query-based systems with
OO features (e.g. Starburst->DB2, Postgres->Illustra->Informix, Oracle)
-
Essentially nobody does both well (though O2 got closest).
Or has it? XML looks kind of like a (sloppy, embarrassing) nested
object representation, and XPath/XSL looks like the declarative features
in some of the OODBs. ObjectStore (Excelon) and Poet now trying to
sell XML engines!
Systems History
Influential research systems:
-
Object-Relational: Starburst (IBM Almaden, now in DB2) and POSTGRES (Berkeley,
now in Informix)
-
Object-Oriented: Gemstone (Maier at OGI), EXODUS (Wisconsin), Genesis (Texas),
Thor (MIT), SHORE (Wisconsin)
-
Others include ObjectStore (Object Design), O2 (Altair), ORION (MCC), Iris
(HP)
Today: focus on POSTGRES, discuss Starburst briefly.
Object-Relational Systems
Informix's buzzphrase for extensible relational systems (courtesy Roger
Sippl). Didn't patent the name!
Query-based, extensible systems with some OO features like ADTs, inheritance,
complex types, OIDs.
SQL99 is very much like this.
Stonebraker’s application matrix:
Query |
RDBMS |
ORDBMS |
No Query |
File Sys. |
OODBMS |
|
Simple Data |
Complex Data |
Argues that the upper right is growing, and will engulf upper left and
lower right.
Some Major Research Challenges in Obj-Rel Extensibility
ADTs and efficient declarative queries -- suddenly the semantics are
encapsulated even to the DBMS! Major theme of Postgres. Many
characteristic extensible API issues -- what MUST you know, what can you
generalize without knowledge of specifics?
-
How to efficiently index opaque types (recall that indexing = storage =
CC&R!)
-
How to optimize queries with opaque methods
-
expensive methods
-
new algebraic operators and transformations (extensible optimizers)
Complex Objects and efficiency. A mix of extensibility features with
"more hard problems in query optimization and indexing".
How to teach optimizer about new operators and their legal transformations.
(Exodus/Volcano, Starburst).
How to efficiently process queries on nested sets, arrays, etc.
How to do data modeling in this world.
Applications:
-
Time-series data
-
Non-real-time multimedia data (digital assets, fingerprints, signatures,
check images, etc.)
-
GIS
-
Complex web stuff (though got to market too slowly!)
-
Scientific apps
Players:
-
Informix (merged Illustra's extensibility features, + GiST??). First to
market, but Informix stumbled. Via multiple M&As, Informix now
owns O2 too (via Ardent).
-
IBM DB2. Slower to ship, very strong.
-
Oracle 8i & 9. Shipping now. Low-tech loose integration approach
solves the marketing problem, works for the low end.
-
NCR (Teradata) bought Wisconsin's Paradise ORDBMS (and DeWitt/Naughton/students).
Seem to have killed it.
- Other big R vendors still late to the game (Sybase, Tandem, MS, etc.)
-
O vendors still not running queries, fleeing the field of battle.
Can they fight Oracle et al. on an XML front? (Doubt it!)
Stonebraker: Extensible ADTs
Seminal paper.
Idea: you should be able to add new atomic types to the system,
along with methods for the types, and new access methods.
Type is defined by:
-
storage size (can be variable)
-
input method
-
output method
-
any other methods a user wishes to provide
Then you could do standard relational processing over those types.
Example: add 2d spatial operators to RDBMS
Engineering issues (some from the paper, some not):
-
Parsing
-
must know about user-defined types and methods (table-driven)
-
Optimization
-
must be able to compute selectivity for user-defined predicates
-
must know about cost of user-defined methods, and consider predicate pullup
-
must know how to match user-defined predicates to fancy new access methods
-
must know whether user-defined join predicates can be evaluated by hash
or merge
-
Execution:
-
must have dynamic linking (24x7 operation)
-
methods called via "function pointers", "functors" or some similar construct
-
support for "untrusted" functions (hot topic that DB people ignored to
their loss)
-
support for "large objects"
-
caching for expensive methods (a la subquery caching)
-
user-definable aggregations
-
3 functions: init, iter, end
-
Access Methods
-
Stonebraker: An access method is a generic object that provides
-
open, get-first, get-next, close, insert, delete
-
if it’s fancy, it takes SARGs and evaluates them quickly
-
needs to provide cost estimates to optimizer
-
Problems
-
integration with CC
-
can be solved by physical logging
-
can open up logging interface for AM-specific logging
-
integration with Buffer Manager
-
In practice, almost nobody used the Postgres Access method extensibility.
Similar lack of buy-in to Informix, DB2's and Oracle's extension mechanisms.
Wrong API (5-star wizards required)!
-
GiST!?
-
Buffer Management
-
Large objects require new schemes
-
Transactions
-
integration with access methods
Note: Extensible ADTs do not fundamentally change a relational system
-
they fit naturally with the relational model
- as Stonebraker shows, they fit naturally into RDBMS query processing
-
for relational "believers" this is all you should need to solve all problems
POSTGRES
Stonebraker, Rowe, a few staff and many students, 1986-1994. Post-INGRES.
The Postgres Data Model
-
Co-opt the OO terminology
-
class = relation
-
instance = tuple
-
object-id = tuple-id
-
method = attribute or function of attributes
-
Support extensible ADTs
-
extensible procedures using C functions
-
binary "operators", which interface to extensible AM
-
Support type constructors
-
trick: use queries
-
columns can be parameterized Postquel functions (returns setof, or tuple)
-
queries can live in fields of a tuple (returns setof or tuple)
-
another exploitation of the view paradigm!
-
these derived objects can optionally be cached (never implemented)
-
nested-dots used to traverse complex object structures
-
leverages EXISTING techniques for relational processing. Cute!
-
added array support directly
-
added class inheritance (gives method inheritance and collection hierarchies)
Implementation Details
-
originally written in LISP, then ran Lisp2C, resulting in a horrible built-in
inheritance mechanism over C
-
only access methods added to Postgres were done "in house"
-
B-tree and R-tree early
-
linear hashing late
-
GiST added in the last couple years
-
"Fast Path" to AMs, as an alternative to "Persistent X"
-
never well-documented or used outside Berkeley -- difference between an
extensibility API and a hack.
-
ADTs as described above
-
No Overwrite Storage, time travel, etc
-
Research project was "shut down" in 94. 2 Berkeley students did a
major cleanup (remove lispisms, remove a number of theses), migrated to
SQL, and released Postgres95. This was picked up by freeware hackers
on the net, and now PostgreSQL seems to be the serious freeware db of choice
(www.postgresql.org).
Postgres Rule System: Active Database support.
-
One of the recurring themes of extensible DBMS work. (And shows up
in limited ways in OS/Nets and they often don't know it -- keep your eyes
peeled! Rule systems are easy to specify and hard to reason about-- dangerous!)
-
Quick primer on rule systems:
-
An old AI idea (Production Rule Systems like OPS5 from 70's)
-
Active DB version: Event-Condition-Action (ECA) rules
-
Events: read, insert, update
- Conditions: arbitrary queries return true.
-
Actions: more queries.
-
Many semantic issues: what to do when 2 rules are triggered (conflict resolution),
what are the transaction semantics (immediate, nested, deferred)
-
Many implementation alternatives: locks, query rewrite, execution
-
Current DBMSs all support a limited version of this (via triggers),
customers want more, more, more.
-
But AI people know that rule systems were a failure.
-
A little rules goes a long way?
Other Postgres stuff
-
Shared-mem parallel version, with new optimization techniques (XPRS)
-
"Inversion" file system
-
PICASSO UI (Rowe & students)
-
Support for tertiary memory
-
Method Indexing
-
Partial Indexes
-
Expensive predicate optimization
Editorial Comments
-
Postgres code was a mess that got partially cleaned up over time.
-
Data model was sloppy but clever. Burst some OO bubbles.
-
As Roger King said: "My cat is object-oriented".
-
Missed an important distinction between Class and Collection.
-
No-overwrite storage was not as bad as you think. Expect the concept to
resurface.
-
Successful research project
-
written off as goofy research, then drove the whole DB industry forward
(another argument for tech transfer via startup)
-
ADT extensibility & dynamic linking very useful. Nested object
stuff less useful (though XML may drive it finally)
-
Many radical ideas in one system! Not all worked, but ambitious.
Starburst
Original goal: build a nice playpen for whatever comes next.
Extensible "in-house". Not by users!
No one survey paper seems to capture the work they did. Best bet: "Starburst
Mid-Flight: As The Dust Clears", Haas, et al., TKDE 1990
Extensibility features:
-
User-defined functions:
-
table expressions: queries or C functions
-
scalar functions
-
no dynamic linking
-
Rule-based Query rewrite engine
-
a little internal rule system
-
conditions and actions are C functions that check and change QGM
-
some nifty rule control mechanisms (rule classes, rule budgets, multiple
conflict res.)
-
Rule-based query optimizer
-
specify bottom-up combinations as a grammar ("backward-chaining rules")
-
Extensible access methods (as in POSTGRES)
-
"Attachments": routines to be automatically called before/after dealing
with an access method
-
used by Starburst Rule System to generate transition logs
-
used to implement pre-computed joins (see below)
"Proofs" of Extensibility:
-
added "signature" attachment to automatically tag tuples with some derived
values
-
though writing the attachment was hard, hooking it in took only 1 day
-
added Outer Join
-
not so simple ? required adding things in QGM, optimizer, and executor
-
IMS attachment
-
pre-computed joins using pointers, which are maintained
-
written by "outsiders" ? Mike Carey & Beau Shekita from Wisconsin ?
in a summer visit to Almaden
-
2nd Rule System called "Alert", based on infinitely-running queries. Lightweight
and pretty effective.
-
Recursive query processing!
Many of the Starburst folks took a "vacation" from research, and merged
Starburst technology into DB2 UDB.