Advanced Topics in Computer Systems
Fall, 2001
Joe Hellerstein & Anthony Joseph

POSTGRES Storage System

An extremely simple solution to the complex recovery problem.

History:

POSTGRES Overview Problem: What’s wrong with this picture?
           ________________
          |     DBMS       |
           ----------------
               /       \
              /         \
           -----      -----
            DB         Log
           -----      -----


Alternative: A no-overwrite storage system.

  1. Time travel comes for free
  2. instantaneous recovery
  3. no crash recovery code
Details Each tuple has a bunch of system fields: Updates work as follows:
  1. Xmax & Cmax set to updater’s XID
  2. new replacement tuple appended to DB with:
Deleters simply set Xmax & Cmax to their XID

The first version of a record is called the Anchor Point, which has a chain of associated delta record

"Hopefully", delta records fit on the same page as their anchor point.

CC, Timestamps, Archiving:

If we actually got timestamps at xact start, we’d get timestamp ordering CC.

Instead, do 2PL, and get timestamp at commit time.

How to set Tmin and Tmax if you don’t have the commit time?

    1. no archive: old versions not needed
    2. light archive: old versions not to be accessed often
    3. heavy archive: old versions to be accessed regularly

Time Travel

Allows queries over a table as of some wall-clock time in the past.

Rewrite queries to handle the system fields in tuples

Reading a Record: get record, follow delta chain until you’ve got the appropriate version constructed.

Indexes all live on disk, and are updated in place (overwrites here)

Archiving

    1. write archive record(s)
    2. write new anchor record
    3. reclaim space of old anchor/deltas

Performance Study vs. WAL

Assumptions: NVRAM required to make POSTGRES compete on even this benchmark.

The Real POSTGRES Story

Ask Not What POSTGRES Can Do For You...