Advanced Topics in Computer Systems
Joe Hellerstein & Eric Brewer

Distributed Transaction Management in R*

Unravels details of logging & messages sent.

Assumptions

Desired Characteristics In order to minimize logging and comm:

Normal Processing (2PC)

Coordinator Log 
Messages 
Subordinate Log 
 
PREPARE 
 
   
prepare*/abort*
 
VOTE YES/NO 
 
commit*/abort*
   
 
COMMIT/ABORT 
 
   
commit*/abort*
 
ACK 
 
end
   
     

since subords force abort (& commit) before ACKing, they never need to ask coord about final outcome.
Rule: never need to ask something you used to know; log before ACKing.
    Guarantees atomicity.

Total cost:
     subords: 2 forced log-writes (prepare/commit), 2 messages (YES/ACK)
     coord: 1 forced log write (commit), 1 async log write (end), 2 messages/subord (prepare/commit)

2PC & Failures

Recovery process per site handles xacts committing at crash, as well as incoming recovery messages.

  1. on restart, read log and accumulate committing xact info in main mem
  2. if you discover a local xact in the prepared state, contact coord to find out what to do
  3. if you discover a local xact that didn’t get prepared, UNDO it, write abort record, and forget
  4. if a local xact was committing (i.e. this is the coord), then send out COMMIT msgs to subords that haven’t ACKed. Similar for aborting.
Upon discovering a failure elsewhere:

    If a coord discovers that a subord is unreachable...
          while waiting for its vote: coord aborts xact as usual
          while waiting for an ACK: coord gives xact to recovery manager
    If a subord discovers that a coord is unreachable...
          if it hasn’t sent a YES vote yet: abort ("unilateral abort")
          if it has sent a YES vote, subord gives xact to recovery manager

 If a recovery mgr receives an inquiry from a subord in prepared state
          if main mem info says xact is committing or aborting, send COMMIT/ABORT
          if main mem info says nothing...?
 

An Aside: Hierarchical 2PC

If you have a tree-shaped process graph.

     root (which talks to user) is a coordinator
     leaves are subordinates
     interior nodes are both
          after receiving PREPARE, propagate it to children
          vote after children; any NO below causes a NO vote
          after receiving a COMMIT record, force-write log, ACK to parent, and propagate to children. Similar for ABORT.
 
 

Presumed Abort
     recall... if main-mem info say nothing, coord says ABORT
     SO... coord can forget a xact immediately after deciding to abort it! (write abort record, THEN forget)
     abort can be async write
     no ACKS required from subords on ABORT
     no need to remember names of subords in abort record, nor write end record after abort
     if coord sees subord has failed, need not pass xact to recovery system; can just ABORT.

Now, look at R/O xacts:
     subords who have only read send READ VOTEs instead of YES VOTEs, release locks, write no log records
     logic is: READ & YES = YES, READ & NO = NO, READ & READ = READ
     if all votes are READ, there’s no second phase
     commit record at coord includes only YES sites
     Tallying up the R/O work:
          nobody writes log records
          nonleaf processes send one message to children
          children send one message (to parent)

Presumed Commit

Idea: let’s invert the logic above, since commit is the fast path:

     require ACK for ABORT, not COMMIT
     subords force abort records, not commit
     no information? Presume commit!

Problem:

  1. subord prepares to commit
  2. coord crashes
  3. on restart, coord aborts transaction and forgets it
  4. subord asks about transaction, coord says "no info = commit!"
  5. subord commits, everyone else does not
Solution:
     coord records names of subords on stable storage before allowing them to prepare ("collecting" record)
     then it can tell them about aborts on restart
     everything else analogous (mirror) to Presumed Abort
     Tallying up the R/O work:
          nonleaf writes collecting (forced) and commit (async)
          nonleaf sends one message to all children (PREPARE)
          children send one message (to parent)

Performance analysis in paper:
     PA > 2PC (> = "beats")
     PA > PC for R/O transactions
     for xacts with only one write subord, PC > PA (equal log writes, PA needs an ACK from subord)
     for n-1 writing subords, PC >> PA (equal logging, but PA forces n-1 times when PC does not ­ commit records of subords.  Also PA send n extra messages)
     choice between PA and PC could be made on a transaction-by transaction basis