Advanced Topics in Computer Systems |
|
Joe Hellerstein & Eric Brewer |
|
Generalized Search Trees
Extensible Access Methods
Access Method = {Heap files, Indexes}.
Extensible access methods an old challenge -- goes back to Ingres.
Solution to date (Ingres, Postgres, spanky new commercial solutions):
expose the iterator interface and some optimizer predicate-matching.
Leaves WAY too much to app writer (including concurrency and recovery logic!)
Indexing: The Big Picture
What is an Index? It's something that answers simple predicates on
a data set without requiring all the data to be examined. It entails
-
A clustering of a data set: allows only a subset of the data to be examined.
-
A lossy compression of the clusters: allows clusters to be chosen.
The clustering can be a strict partitioning of the data (as in B-trees,
hash indexes, R-trees, etc.) or not (as in inverted files, R+-trees, etc.)
The clusters can be a strict partitioning of the data domain (as
in B-trees, hash indexes, quad trees, etc.) or not (R-trees, etc.)
The lossy compression can be functional (e.g. a hash function) or stored
(e.g. explicitly stored predicates). Stored compressions are usually
arranged into a multi-resolution hierarchy -- a tree. (The hierarchy
can be exploited not only for search, but also for estimation and visualization,
and for concurrency control!)
GiSTs for DBMSs
In databases, search trees are typically disk-block-per-node, meaning you
want guaranteed short height for few I/Os during search. Typical
scheme is a balanced tree that splits upwards (a la B-trees).
Invariants:
-
Data stored in the leaves only
-
Tree always balanced-height
-
Tree always guides search to all the data matching a query
GiST encapsulates this logic in a data- and query-independent way,
exposing only 4 important methods:
-
Consistent: given query predicate and subtree predicate, decide whether
to traverse
-
Penalty: given insertion value and subtree predicate, decide how bad it
is to insert below.
-
PickSplit: given a pageful of items to split among two pages, split into
two
-
Union: given a set of items, generate a subtree predicate above them.
This seems simple. It is simple. About 1/10th of the
single-user R-tree logic.
Concurrency/Recovery logic in indexes adds the really hard complexity,
which can also be encapsulated into the GiST implementation (see Kornacker/Mohan/Hellerstein
SIGMOD '97).
GiST can also be extended to handle extensible kinds of search, e.g.
near-neighbor search. See Aoki, ICDE '99.
The subtree predicates in GiST can also be used for estimation/visualization:
see Aoki SSDBM '99.
GiST can be tuned to minimize calls (boundary crossings) to user-defined
code, and efficiently support user-specified page layout and key compression.
See Kornacker, VLDB '99.
Think extensibility is expensive? GiST R-trees were built in Informix
and shown to be faster than native R-trees; see Kornacker, VLDB '99.
Extensibility Lessons From GiST
You can take a big body of related ideas and encapsulate away the common
logic. Makes innovation easier.
More importantly, highlights the innovation, encourages separating the
basic issues during analysis and design. Not just "it works this
much better", but "it works better because...". E.g. SR-tree study.
Index performance: Workload-dependent! Boils down to
-
excess coverage (i.e. bad lossy subtree predicates)
-
poor clustering of data into leaves (doesn't match query co-retrieval)
-
poor space utilization
See amdb and related papers
(UIDIS
'99, TODS
submission) for the "access method debugger and profiler" and a discussion
of the issues in designing and analyzing indexes.
Indexability theory: covers the clustering problem, including
consideration of space/time tradeoffs. (Hellerstein/Koustoupias/Miranker/Papadimitriou/Samoladas/Taylor.
See PODS '97, '98, '99 and upcoming JACM). Some workloads are "unindexable",
easy to check! All this pops out when you divorce yourself from the
data structure details.
Danger in extensibility: try to cover all possibilities. E.g.
"why can't GiST be extended to do linear hash indexes?"