Advanced Topics in Computer Systems
Joe Hellerstein & Eric Brewer

Generalized Search Trees

Extensible Access Methods

Access Method = {Heap files, Indexes}.
Extensible access methods an old challenge -- goes back to Ingres.
Solution to date (Ingres, Postgres, spanky new commercial solutions): expose the iterator interface and some optimizer predicate-matching. Leaves WAY too much to app writer (including concurrency and recovery logic!)

Indexing: The Big Picture

What is an Index? It's something that answers simple predicates on a data set without requiring all the data to be examined. It entails

A clustering of a data set: allows only a subset of the data to be examined.
A lossy compression of the clusters: allows clusters to be chosen.

The clustering can be a strict partitioning of the data (as in B-trees, hash indexes, R-trees, etc.) or not (as in inverted files, R+-trees, etc.) The clusters can be a strict partitioning of the data domain (as in B-trees, hash indexes, quad trees, etc.) or not (R-trees, etc.)

The lossy compression can be functional (e.g. a hash function) or stored (e.g. explicitly stored predicates). Stored compressions are usually arranged into a multi-resolution hierarchy -- a tree. (The hierarchy can be exploited not only for search, but also for estimation and visualization, and for concurrency control!)

GiSTs for DBMSs

In databases, search trees are typically disk-block-per-node, meaning you want guaranteed short height for few I/Os during search. Typical scheme is a balanced tree that splits upwards (a la B-trees).

Invariants:

Data stored in the leaves only
Tree always balanced-height
Tree always guides search to all the data matching a query

GiST encapsulates this logic in a data- and query-independent way, exposing only 4 important methods:

Consistent: given query predicate and subtree predicate, decide whether to traverse
Penalty: given insertion value and subtree predicate, decide how bad it is to insert below.
PickSplit: given a pageful of items to split among two pages, split into two
Union: given a set of items, generate a subtree predicate above them.

This seems simple. It is simple. About 1/10th of the single-user R-tree logic.

Concurrency/Recovery logic in indexes adds the really hard complexity, which can also be encapsulated into the GiST implementation (see Kornacker/Mohan/Hellerstein SIGMOD '97).

GiST can also be extended to handle extensible kinds of search, e.g. near-neighbor search. See Aoki, ICDE '99.

The subtree predicates in GiST can also be used for estimation/visualization: see Aoki SSDBM '99.

GiST can be tuned to minimize calls (boundary crossings) to user-defined code, and efficiently support user-specified page layout and key compression. See Kornacker, VLDB '99.

Think extensibility is expensive? GiST R-trees were built in Informix and shown to be faster than native R-trees; see Kornacker, VLDB '99.

Extensibility Lessons From GiST

You can take a big body of related ideas and encapsulate away the common logic. Makes innovation easier.

More importantly, highlights the innovation, encourages separating the basic issues during analysis and design. Not just "it works this much better", but "it works better because...". E.g. SR-tree study.

Index performance: Workload-dependent! Boils down to

excess coverage (i.e. bad lossy subtree predicates)
poor clustering of data into leaves (doesn't match query co-retrieval)
poor space utilization

See amdb and related papers (UIDIS '99, TODS submission) for the "access method debugger and profiler" and a discussion of the issues in designing and analyzing indexes.

Indexability theory: covers the clustering problem, including consideration of space/time tradeoffs. (Hellerstein/Koustoupias/Miranker/Papadimitriou/Samoladas/Taylor. See PODS '97, '98, '99 and upcoming JACM). Some workloads are "unindexable", easy to check! All this pops out when you divorce yourself from the data structure details.

Danger in extensibility: try to cover all possibilities. E.g. "why can't GiST be extended to do linear hash indexes?"