CS262a: Buffer Management

Advanced Topics in Computer Systems	9/17/01
Anthony Joseph & Joe Hellerstein

Buffer Management: DBMIN (Chou & DeWitt)

Theme: There are just a few basic access patterns in a query processing system. Make straightforward observations about locality behavior of these access patterns, and expose them to your buffer manager. Then the buffer manager can choose wisely.

Some people might call this "understanding application semantics", but there's no semantic issue here, just performance issues.
Note: reason to read this paper is not to understand how to build a DBMS buffer manager. It's to (a) understand common access patterns in DBMSs, and (b) understand what games you can play with extra information. DBMSs often have this information in advance, as a side effect of query optimization and limited workload types.
Predictive versus stochastic approach

Old Stuff:

Background: buffers (groups of frames), frames (groups of pages), pages, "pinning"
Standard OS replacement policies: LRU, MRU, FIFO, LIFO, Clock, etc. None is uniformly appropriate for typical DBMS access patterns (see Stonebraker’s "OS Dump").
Domain Separation: Split up work/memory into categories, and do LRU within your category. If there are no pages in your chunk of memory, borrow from somewhere else. Example domains: a non-leaf level of a B+-tree. 8-10% improvement over global LRU(?)

domains are static
pages belong to partitions regardless of how they’re being used (e.g. no distinction between heap file page for sequential scan and for nested loop. Also indicies more important than data)
does not prevent over-utilization of memory by multiple users, since there’s no notion of "users" or "processes". Hence needs an orthogonal load-control facility (probably admission control).

Group LRU: Like DS, but prioritize domains. Free buffers are stolen in order of priority (low à high)

optimization: adjust sizes of domains using a "working-set" judgment (i.e. pages in last T_i refs are kept for domain i)
no convincing evidence that any of this works better than LRU or Clock.

The "New" Algorithm: Modification of INGRES.

Two important observations:

priority not a property of a page, but of a relation
each relation needs a working set

Note: this is a query-based intuition! Anticipates DBMIN.
Multiple pools on a per-relation basis
Subdivide memory into a chunk per relation, and prioritize chunks.
Assign an empty resident set per relation.
Use MRU per relation, but each relation gets one buffer at all times.

Why MRU? Beat by plain LRU!

Heuristics available to reprioritize chunks.

Arbitrary?

Simulation study looked good, but implementation in INGRES didn’t beat LRU.

Also, how to handle multiple users?

Hot Set

Also uses query-based info
A set of pages which are accessed over and over form a "hot set".

If you graph buffer size against # of page faults, you see "hot points".

If your hot set fits in memory, you win! Otherwise you lose.
Example: Nested Loop-join, the hot set is |entire inner relation| + 1 (for outer relation)
Admission control: don’t let a query into the system unless its hot set fits in memory. Each query is given its hot set worth of buffers.
The idea assumes LRU is going on. But MRU is better for looping access, and makes the "hot point" go away
Using LRU over-allocates (i.e. under-utilizes) memory, since the "hot point" analysis can be fixed with MRU.

DBMIN

Based on the Query Locality Set Model (QLSM), which characterizes DBMS reference patterns in 3 ways:

Sequential: Straight Sequential (SS), Clustered Sequential (CS), & Looping Sequential (LS)

Read something sequentially once
Ex. Select on unordered relation – touch once, so use one page

Repeatedly read a sequential chunk
Ex. Merge join (keep together records with same key in inner relation)

Read something sequentially repeatedly
Ex. Nested join (keep inner relation in memory with MRU replacement)

Random: Independent Random (IR) and Clustered Random (CR)

Truly random access
Ex. Index scan through non-clustered index yields random data page access

Random accesses which happen to demonstrate locality
Ex. Duplicate keys in outer relation of join

Hierarchical: Straight Hierarchical (SH), Hierarchical with Straight Sequential (H/SS), Hierarchical with Clustered Sequential (H/CS), Looping Hierarchical (LH)

Traversal path down from root to leaves of an index
SH: Regular tree traversal
H/SS: Traversal followed by a sequential scan
H/CS: Traversal followed by a clustered scan
LH: Repeatedly re-traverse an index

Questions: which query processing operators correspond to each category? Do the categories cover all the operators?

The DBMIN Algorithm:

Associate a chunk of memory with each "file instance" (more like each table in the FROM clause). This is called the file instance’s locality set. FINALLY -- the right notion of "domain separation", based on how query processing uses relations.

KEY IDEA: in performance, study workload, not just content. Very often, content is at best a stand-in for behavior!
File systems work routinely gets this wrong (e.g. cluster files that are in the same directory.)
See discussion of indexing (GiST and amdb) next spring.

Estimate max locality set sizes via looking at query plan & database stats. A query is allowed to run if the sum of its locality sets fits in free buffers.
A global page table and global free list is kept in addition to locality sets
On page request

if page in global table & the locality set, just update usage stats of page
else if page in memory but not in LocSet, grab page, and if not in another LocSet put it in our LocSet
else read page into LocSet (using a page from global free list)
If locality set gets bigger than max needed, choose a page to toss according to a LocSet-specific policy (to be discussed next)

Do admission control:

If not enough memory when a query is issued, make query wait. Why?

Locality Set size & replacement policies for different reference patterns:

Straight Sequential: LocSet size = 1. Replace that page as soon as needed.
Clustered Sequential: LocSet size = (#tuples in largest cluster)/(# of tuples per page). FIFO or LRU replacement work.
Looping Sequential: LocSet size = size of relation (“file”). MRU is best.
Independent Random: odds of revisit are low, so LocSet either 1, or the magic number b from the "Yao" formula. Residual value r = (k - b)/b of a page can be used to choose between 1 and b (i.e. k is number of accesses, b is # of pages that will be referenced, so this is # of revisits over # of pages). Replacement policy?
Clustered Random: Just like CS, but pages are not packed onto blocks. So it’s just # of tuples in largest cluster. LRU or FIFO replacement.
Straight Hierarchical, Hierarchical/Straight Sequential: like Straight Sequential.
Hierarchical/Clustered Sequential: like CS, but replace tuples with (key,ptr) pairs
Looping Hierarchical: at each level h of an index, you have random access among pages. Use Yao to figure out how many pages you’ll access at each level in k lookups. LocSet is sum of these over all levels that you choose to worry about (maybe only the root!). LIFO with a few (4-5) buffers probably an OK replacement strategy.

A Detailed Simulation Study (Welcome to Wisconsin!)

Trace-based & distribution-based (stochastic): traces used to model individual queries, but workload synthesized based on different distributions. Traces done on the database of a popular query-centric benchmark (the Wisconsin Benchmark). Queries of 1 and 2 tables.
Simulator models CPU, one I/O device, and RAM access. Simulation tuned to micro-benchmark of WiSS. Performance metric is query throughput.
3 levels of sharing modeled: full, half, and no sharing
Disk arm was jostled around randomly to model costs of I/O in a shared system.
Memory set big enough to hold about 8 concurrent working sets.
Statistical confidence intervals on validity of results (how often do you see that in CS performance studies these days?! Why not?)
Comparison of RAND, FIFO, CLOCK, HOT, Working Set, DBMIN

Queries/second is the metric. Alternatives?

Bottom line:

As expected, DBMIN is the top line on every graph
Heuristic techniques are no good, though feedback-based admission control helps

Later work on such admission control shows it's VERY tricky in mixed workloads.

Working Set not a big winner either, though a popular OS choice
DBMIN beats HOT by 7-13% more throughput
Caveats:

Is this a case of test and train on the same data?
Results are very similar for partial sharing
For full sharing, many of the algorithms converge (Note that for query mix 3, there are still noticeable differences). RAND and FIFO are still noticeably worse

Later work: LRU-K, by O'Neil & O'Neil , and a proof of its optimality under certain assumptions.