Advanced Topics in Computer Systems
|
9/17/01
|
Anthony Joseph & Joe Hellerstein
|
|
Buffer Management: DBMIN (Chou &
DeWitt)
Theme: There are just a few basic access patterns in
a query processing system. Make straightforward observations about locality behavior
of these access patterns, and expose them to your buffer manager. Then
the buffer manager can choose wisely.
- Some people might call this
"understanding application semantics", but there's no semantic
issue here, just performance issues.
- Note: reason to read this
paper is not to understand how to build a DBMS buffer manager. It's
to (a) understand common access patterns in DBMSs, and (b) understand what
games you can play with extra information. DBMSs often have this
information in advance, as a side effect of query optimization and limited
workload types.
- Predictive versus stochastic
approach
Old Stuff:
- Background: buffers (groups
of frames), frames (groups of pages), pages, "pinning"
- Standard OS replacement
policies: LRU, MRU, FIFO, LIFO, Clock, etc. None is uniformly appropriate
for typical DBMS access patterns (see Stonebraker’s "OS Dump").
- Domain Separation: Split up
work/memory into categories, and do LRU within your category. If there are
no pages in your chunk of memory, borrow from somewhere else. Example
domains: a non-leaf level of a B+-tree. 8-10% improvement over global
LRU(?)
- domains are static
- pages belong to
partitions regardless of how they’re being used (e.g. no distinction
between heap file page for sequential scan and for nested loop. Also
indicies more important than data)
- does not prevent
over-utilization of memory by multiple users, since there’s no notion of
"users" or "processes". Hence needs an
orthogonal load-control facility (probably admission control).
- Group LRU: Like DS, but
prioritize domains. Free buffers are stolen in order of priority (low à high)
- optimization: adjust
sizes of domains using a "working-set" judgment (i.e. pages in
last Ti refs are kept for domain i)
- no convincing evidence
that any of this works better than LRU or Clock.
- The "New"
Algorithm: Modification of INGRES.
- Two important
observations:
- priority not a
property of a page, but of a relation
- each relation needs a
working set
- Note: this is a
query-based intuition! Anticipates DBMIN.
- Multiple pools on a
per-relation basis
- Subdivide memory into
a chunk per relation, and prioritize chunks.
- Assign an empty
resident set per relation.
- Use MRU per relation,
but each relation gets one buffer at all times.
- Why MRU? Beat by
plain LRU!
- Heuristics available
to reprioritize chunks.
- Arbitrary?
- Simulation study
looked good, but implementation in INGRES didn’t beat LRU.
- Also, how to handle
multiple users?
- Hot Set
- Also uses query-based
info
- A set of pages which
are accessed over and over form a "hot set".
- If you graph buffer
size against # of page faults, you see "hot points".
- If your hot set fits
in memory, you win! Otherwise you lose.
- Example: Nested
Loop-join, the hot set is |entire inner relation| + 1 (for outer
relation)
- Admission control:
don’t let a query into the system unless its hot set fits in memory. Each
query is given its hot set worth of buffers.
- The idea assumes LRU
is going on. But MRU is better for looping access, and makes the
"hot point" go away
- Using LRU
over-allocates (i.e. under-utilizes) memory, since the "hot
point" analysis can be fixed with MRU.
DBMIN
Based on the Query Locality Set Model (QLSM), which characterizes DBMS
reference patterns in 3 ways:
- Sequential: Straight
Sequential (SS), Clustered Sequential (CS), & Looping Sequential (LS)
- SS
- Read something
sequentially once
- Ex. Select on
unordered relation – touch once, so use one page
- CS
- Repeatedly read a
sequential chunk
- Ex. Merge join (keep
together records with same key in inner relation)
- LS
- Read something
sequentially repeatedly
- Ex. Nested join (keep
inner relation in memory with MRU replacement)
- Random: Independent Random
(IR) and Clustered Random (CR)
- IR
- Truly random access
- Ex. Index scan
through non-clustered index yields random data page access
- CR
- Random accesses which
happen to demonstrate locality
- Ex. Duplicate keys in
outer relation of join
- Hierarchical: Straight
Hierarchical (SH), Hierarchical with Straight Sequential (H/SS),
Hierarchical with Clustered Sequential (H/CS), Looping Hierarchical (LH)
- Traversal path down
from root to leaves of an index
- SH: Regular tree
traversal
- H/SS: Traversal
followed by a sequential scan
- H/CS: Traversal
followed by a clustered scan
- LH: Repeatedly
re-traverse an index
Questions: which query processing operators correspond to
each category? Do the categories cover all the operators?
The DBMIN Algorithm:
- Associate a chunk of memory
with each "file instance" (more like each table in the FROM
clause). This is called the file instance’s locality set. FINALLY --
the right notion of "domain separation", based on how query
processing uses relations.
- KEY IDEA: in
performance, study workload, not just content. Very often,
content is at best a stand-in for behavior!
- File systems work
routinely gets this wrong (e.g. cluster files that are in the same
directory.)
- See discussion of
indexing (GiST and amdb) next
spring.
- Estimate max locality set
sizes via looking at query plan & database stats. A query is allowed
to run if the sum of its locality sets fits in free buffers.
- A global page table and
global free list is kept in addition to locality sets
- On page request
- if page in global
table & the locality set, just update usage stats of page
- else if page in memory
but not in LocSet, grab page, and if not in another LocSet put it in our
LocSet
- else read page into
LocSet (using a page from global free list)
- If locality set gets
bigger than max needed, choose a page to toss according to a
LocSet-specific policy (to be discussed next)
- Do admission control:
- If not enough memory
when a query is issued, make query wait. Why?
Locality Set size & replacement policies for different
reference patterns:
- Straight Sequential: LocSet
size = 1. Replace that page as soon as needed.
- Clustered Sequential: LocSet
size = (#tuples in largest cluster)/(# of tuples per page). FIFO or LRU
replacement work.
- Looping Sequential: LocSet
size = size of relation (“file”). MRU is best.
- Independent Random: odds of
revisit are low, so LocSet either 1, or the magic number b from the
"Yao" formula. Residual value r = (k - b)/b of a page can
be used to choose between 1 and b (i.e. k is number of
accesses, b is # of pages that will be referenced, so this is # of
revisits over # of pages). Replacement policy?
- Clustered Random: Just like
CS, but pages are not packed onto blocks. So it’s just # of tuples in
largest cluster. LRU or FIFO replacement.
- Straight Hierarchical,
Hierarchical/Straight Sequential: like Straight Sequential.
- Hierarchical/Clustered
Sequential: like CS, but replace tuples with (key,ptr) pairs
- Looping Hierarchical: at each
level h of an index, you have random access among pages. Use Yao to figure
out how many pages you’ll access at each level in k lookups. LocSet
is sum of these over all levels that you choose to worry about (maybe only
the root!). LIFO with a few (4-5) buffers probably an OK replacement
strategy.
A Detailed Simulation Study (Welcome to Wisconsin!)
- Trace-based &
distribution-based (stochastic): traces used to model individual queries,
but workload synthesized based on different distributions. Traces done on
the database of a popular query-centric benchmark (the Wisconsin
Benchmark). Queries of 1 and 2 tables.
- Simulator models CPU, one I/O
device, and RAM access. Simulation tuned to micro-benchmark of WiSS.
Performance metric is query throughput.
- 3 levels of sharing modeled:
full, half, and no sharing
- Disk arm was jostled around
randomly to model costs of I/O in a shared system.
- Memory set big enough to hold
about 8 concurrent working sets.
- Statistical confidence
intervals on validity of results (how often do you see that in CS
performance studies these days?! Why not?)
- Comparison of RAND, FIFO,
CLOCK, HOT, Working Set, DBMIN
- Queries/second is the
metric. Alternatives?
- Bottom line:
- As expected, DBMIN is
the top line on every graph
- Heuristic techniques
are no good, though feedback-based admission control helps
- Later work on such
admission control shows it's VERY tricky in mixed workloads.
- Working Set not a big
winner either, though a popular OS choice
- DBMIN beats HOT by
7-13% more throughput
- Caveats:
- Is this a case of
test and train on the same data?
- Results are very
similar for partial sharing
- For full sharing,
many of the algorithms converge (Note that for query mix 3, there are
still noticeable differences). RAND and FIFO are still noticeably worse
Later work: LRU-K,
by O'Neil & O'Neil , and a proof of its optimality under
certain assumptions.