Some of the old hierarchical and network DBs had pretty clever systems
for handling reliability. BUT -- they had no formalisms to describe
the semantics clearly, and hence few lessons to transfer to other systems.
System R's RSS team, led by Jim Gray, codified the formal notion of transactions
and
serializability,
and System R delivered a working (though inefficient) implementation.
Led to 1998 Turing Award for Gray.
Our own Christos Papadimitriou did early work formalizing theoretical results
on transactions (and has a book on the topic!)
Various companies slogged through the details of transactions over the
years, especially the complex issues of logging and recovery. In
the 80's and into the 90's, IBM published papers on ARIES, which is the
classic discussion of the dirty details of logging and recover. Many
industry vets "already knew" this stuff and had done it in commercial systems.
There is an entire industry of Transaction Processing that exists outside
of DBs, and related to distributed applications and services. More
on this later in the semester.
Generally a field populated by industrial practitioners!
Since the early days, academic work has focused largely on esoteric issues
of extensions to the flat transaction model. Little of this has found
practical use.
Background
Transactions are concepts that allow a system to guarantee certain semantic
properties.
These guarantees must be rigorously defined so that people can build correct
systems above them.
Theory meets practice here in a nice way.
Kinds of Actions
Unprotected actions: you cannot count on these. It is the
responsibility of higher level actions to check on these if they want to
be sure.
Protected actions: actions enclosed inside of transactions.
Real actions: actions that are visible outside the computer.
Print to screen, send a web page, output money, drill hole, fire missile,
etc.
getting these right is very tricky!
easier if idempotent (a la "drill hole")
doable but tricky if you can check the state of the real world somehow.
we'll return to this later when we talk about transactional networking.
otherwise impossible to handle!
A.C.I.D.
A transaction should enjoy the following guarantees:
Atomicity: the "all or nothing" property. Programmer needn't
worry about partial states persisting.
Consistency: the database should start out "consistent", and at
the end of transaction remain "consistent". The definition of "consistent"
is up to the database administrator to define to the system; other notions
of consistency must be handled by the application.
Gray & Reuter are very confusing on this point.
Isolation: a transaction should not see the effects of other uncommitted
transactions.
Durability: once committed, the transactions effects should not
disappear (though they may be overwritten by subsequent commited transactions).
A and D are guaranteed by recovery (usually implemented
via logging). C and I are guaranteed by concurrency control
(usually implemented via locking).
Concurrency Control & Serializability
CS186 material you might not know.
We want to allow multiple transactions to operate concurrently. (Why?)
Need a crisp definition of acceptable and unacceptable concurrency
build that on a notion of acceptable orders of operation
DB only understands simple data-oriented operations: read, write, begin,
commit, abort
order of operations captured in a transaction schedule:
transactions (Ti) and data objects (a...z)
all operations of a transactions occur in the order specified by the transaction
schedule captures the interleaving of the operations in time
e.g. R1(a), R2(a), W2(a), W1(a),
One reasonable definition of correct order of operation: serial schedules
rough defn: those schedules that have no interleaving of operations across
xactions
note: in this definition, any serial schedule is correct
-- order of transactions isn't guaranteed in any way!
What we'd like
to get the effect of serial schedules, but still get lots of concurrency
serializable schedules are those schedules that are "equivalent"
to serial schedules
definition of equivalence: produces the same final state as a serial schedule
What could mess up serializability?
conflicts: two transactions share a data item, and at least one
writes it. In time, we have RW (unrepeatable read), WR (dirty read),
and WW (lost write) conflicts.
These conflicts can be OK if you're careful!
Can draw a serializability graph (directed), where nodes are transactions,
and edges are conflicts in order of time.
Definition: 2 schedules are conflict equivalent if they have the
same actions, and each pair of conflicting actions is ordered the same
way.
Definition: a schedule is conflict serializable if it is conflict
equivalent to a serial schedule.
Note: some serializable schedules are NOT conflict serializable!
Theorem: A schedule is conflict serializable if and only if its
serializability graph is acyclic.
Do you see the proof?
Now we have a definition, and a way to be sure we're serializable.
We need a mechanism to enforce serializability.
Two-Phase Locking (2PL): based on 3 rules
before reading an object, a transaction must get a shared lock on it
before writing an object, a xaction must get an exclusive lock on it
once a xaction releases one lock, it cannot request any more locks (hence
the name 2PL)
Theorem: 2PL ensures that the precedence graph generated by transactions
will be acyclic -- so serializability is enforced.
Can you see the proof here?
Strict 2PL is a variant of 2PL in which a xaction must drop all
its locks at once.
has nice features: avoids "cascading aborts", and guarantees "recoverable"
schedules.
Recovery
Just you wait! We will dig (deep) into this subject in subsequent
reading on ARIES.