Intro on Transactions

Advanced Topics in Computer Systems
Joe Hellerstein & Anthony Joseph

Introduction to Transactions

Quick History

Some of the old hierarchical and network DBs had pretty clever systems for handling reliability. BUT -- they had no formalisms to describe the semantics clearly, and hence few lessons to transfer to other systems.
System R's RSS team, led by Jim Gray, codified the formal notion of transactions and serializability, and System R delivered a working (though inefficient) implementation. Led to 1998 Turing Award for Gray.
Our own Christos Papadimitriou did early work formalizing theoretical results on transactions (and has a book on the topic!)
Various companies slogged through the details of transactions over the years, especially the complex issues of logging and recovery. In the 80's and into the 90's, IBM published papers on ARIES, which is the classic discussion of the dirty details of logging and recover. Many industry vets "already knew" this stuff and had done it in commercial systems.
There is an entire industry of Transaction Processing that exists outside of DBs, and related to distributed applications and services. More on this later in the semester.
Generally a field populated by industrial practitioners!
Since the early days, academic work has focused largely on esoteric issues of extensions to the flat transaction model. Little of this has found practical use.

Background

Transactions are concepts that allow a system to guarantee certain semantic properties.
These guarantees must be rigorously defined so that people can build correct systems above them.
Theory meets practice here in a nice way.

Kinds of Actions

Unprotected actions: you cannot count on these. It is the responsibility of higher level actions to check on these if they want to be sure.
Protected actions: actions enclosed inside of transactions.
Real actions: actions that are visible outside the computer. Print to screen, send a web page, output money, drill hole, fire missile, etc.

getting these right is very tricky!
easier if idempotent (a la "drill hole")
doable but tricky if you can check the state of the real world somehow. we'll return to this later when we talk about transactional networking.
otherwise impossible to handle!

A.C.I.D.

A transaction should enjoy the following guarantees:

Atomicity: the "all or nothing" property. Programmer needn't worry about partial states persisting.
Consistency: the database should start out "consistent", and at the end of transaction remain "consistent". The definition of "consistent" is up to the database administrator to define to the system; other notions of consistency must be handled by the application.

Gray & Reuter are very confusing on this point.

Isolation: a transaction should not see the effects of other uncommitted transactions.
Durability: once committed, the transactions effects should not disappear (though they may be overwritten by subsequent commited transactions).

A and D are guaranteed by recovery (usually implemented via logging).
C and I are guaranteed by concurrency control (usually implemented via locking).

Concurrency Control & Serializability

CS186 material you might not know.

We want to allow multiple transactions to operate concurrently. (Why?)
Need a crisp definition of acceptable and unacceptable concurrency

build that on a notion of acceptable orders of operation
DB only understands simple data-oriented operations: read, write, begin, commit, abort
order of operations captured in a transaction schedule:

transactions (T_i) and data objects (a...z)
all operations of a transactions occur in the order specified by the transaction
schedule captures the interleaving of the operations in time

e.g. R₁(a), R₂(a), W₂(a), W₁(a),

One reasonable definition of correct order of operation: serial schedules

rough defn: those schedules that have no interleaving of operations across xactions
note: in this definition, any serial schedule is correct -- order of transactions isn't guaranteed in any way!

What we'd like

to get the effect of serial schedules, but still get lots of concurrency
serializable schedules are those schedules that are "equivalent" to serial schedules

definition of equivalence: produces the same final state as a serial schedule

What could mess up serializability?

conflicts: two transactions share a data item, and at least one writes it. In time, we have RW (unrepeatable read), WR (dirty read), and WW (lost write) conflicts.
These conflicts can be OK if you're careful!
Can draw a serializability graph (directed), where nodes are transactions, and edges are conflicts in order of time.
Definition: 2 schedules are conflict equivalent if they have the same actions, and each pair of conflicting actions is ordered the same way.
Definition: a schedule is conflict serializable if it is conflict equivalent to a serial schedule.

Note: some serializable schedules are NOT conflict serializable!

Theorem: A schedule is conflict serializable if and only if its serializability graph is acyclic.

Do you see the proof?

Now we have a definition, and a way to be sure we're serializable. We need a mechanism to enforce serializability.

Two-Phase Locking (2PL): based on 3 rules

before reading an object, a transaction must get a shared lock on it
before writing an object, a xaction must get an exclusive lock on it
once a xaction releases one lock, it cannot request any more locks (hence the name 2PL)

Theorem: 2PL ensures that the precedence graph generated by transactions will be acyclic -- so serializability is enforced.

Can you see the proof here?

Strict 2PL is a variant of 2PL in which a xaction must drop all its locks at once.

has nice features: avoids "cascading aborts", and guarantees "recoverable" schedules.

Recovery

Just you wait! We will dig (deep) into this subject in subsequent reading on ARIES.

Main mechanism used: Write-Ahead Logging (WAL)