The Postgres Storage Manager

Advanced Topics in Computer Systems	Fall, 2001
Joe Hellerstein & Anthony Joseph

POSTGRES Storage System

An extremely simple solution to the complex recovery problem.

History:

POSTGRES storage system (Stonebraker) discussed in Berkeley Tech Reports from 1985
LFS (Ousterhout & Douglis) discussed in Berkeley Tech Reports from 1988
I have been assured that the two had little or nothing to do with each other

though Stonebraker is thanked in the "Beating the I/O Bottleneck" TR from 1988

Was it the Peet's Coffee?

POSTGRES Overview

Followon to INGRES
The prototype for "Object-Relational" databases

extensible OO types (and methods -- i.e. download code into the DBMS)
extensible access methods
"active" database (triggers and rules)
a novel storage manager

A second system

The extensibility features worked, and were extremely influential in research and industry (to be covered next spring)
The active DB stuff mostly worked, and had big impact in research
The storage manager worked slowly, and has had little impact.

"When considering the POSTGRES storage system, we were guided by a missionary zeal to do something different"

Problem:

WAL recovery code is really complicated (witness ARIES)
recovery code must be flawless
failures can be arbitrary – testing is hard to pull off

What’s wrong with this picture?

           ________________
          |     DBMS       |
           ----------------
               /       \
              /         \
           -----      -----
            DB         Log
           -----      -----

Alternative: A no-overwrite storage system.

Time travel comes for free
instantaneous recovery
no crash recovery code

Details

Life of a xact:

increment and grab current global XID
do processing
change status to committed in log (more on this below)
FORCE data & log to stable storage (in that order!)

There is a log:

tail of log (oldest active xact to present) needs 2 bits per transaction to record state (committed, aborted, in progress)
body of log needs only 1 bit per xact (committed or aborted)
at 1 xact per second, 1 year of transactions fits in 4Mb log space!
Detail: if this is still too big, use a Bloom filter to represent aborted xacts (lossy compression)
with just a little NVRAM, the log essentially never needs forcing

Each tuple has a bunch of system fields:

OID: a database-wide unique ID across all time
Xmin: XID of inserter
Tmin: commit time of Xmin
Cmin: command ID of inserter
Xmax: XID of deleter (if any)
Tmax: commit time of Xmax (if any)
Cmax: command ID of deleter (if any)
PTR: pointer to chain of deltas

Updates work as follows:

Xmax & Cmax set to updater’s XID
new replacement tuple appended to DB with:

OID of old record
Xmin & Cmin = XID of updater
in fact, store this as delta off original tuple

Deleters simply set Xmax & Cmax to their XID

The first version of a record is called the Anchor Point, which has a chain of associated delta record

"Hopefully", delta records fit on the same page as their anchor point.

Obvious optimization for read-intensive workloads?

CC, Timestamps, Archiving:

If we actually got timestamps at xact start, we’d get timestamp ordering CC.

Instead, do 2PL, and get timestamp at commit time.

How to set Tmin and Tmax if you don’t have the commit time?

XID is taken as an oid from the TIME relation
at commit:

update your appropriate TIME tuple with the wall-clock time.
then force data pages to stable storage, change status to committed in tail of log

3 levels of archiving

no archive: old versions not needed
light archive: old versions not to be accessed often
heavy archive: old versions to be accessed regularly

on first access to a tuple from a "heavy archive" relation, you update the OIDs in Tmin and Tmax with values from the TIME relation

Time Travel

Allows queries over a table as of some wall-clock time in the past.

Rewrite queries to handle the system fields in tuples

Reading a Record: get record, follow delta chain until you’ve got the appropriate version constructed.

Indexes all live on disk, and are updated in place (overwrites here)

Archiving

historical data can be forced to archive via the vacuum cleaner

write archive record(s)
write new anchor record
reclaim space of old anchor/deltas

crash during vacuum?

indexes may lose archive records: this will be discovered at runtime and fixed via a Seq. Scan
duplicate records may be forced to archive: OK because POSTGRES doesn’t do multisets

Can build R-trees over lifetime intervals of data on archive

Performance Study vs. WAL

Assumptions:

records fit on a single page
deltas live on the same page as anchors
single-record xacts
update-only workload (?!)

NVRAM required to make POSTGRES compete on even this benchmark.

The Real POSTGRES Story

PostgreSQL still uses this storage manager (somewhat improved since the paper was written)
Tuple differencing never implemented
R-trees over archive not used
Stonebraker commercialized POSTGRES at Illustra
Illustra never claimed to be a TP competitor
Informix bought Illustra, and replaced the no-overwrite storage manager with Informix’s WAL
IBM bought Informix.

Ask Not What POSTGRES Can Do For You...

If you did made a handful of obvious design changes to POSTGRES, would it be viable today?

Could certainly be made better
CS262 project there

What if you throw in:

a little NVRAM
flush to remote memory in a cluster
variable-sized storage units
limited time travel

The Telegraph Storage manager was marching down this path...

nice cs262 project