In this course we will study several dimensions of the reliability problem, starting with traditional fault-tolerance techniques and including the most recent research results in the area. The emphasis will be on the principles used to achieve high reliability, including algorithms, fault models, and techniques for reasoning about the behavior of distributed systems.
This course will be a project-oriented research seminar with scheduled speakers, substantial readings, and in-depth discussions with speakers. It will meet twice a week. The Tuesday meeting (3:30-5:00 PM, 380 Soda) will involve lecture presentations, discussions of readings, and project brainstorming. The Thursday meeting will follow the Systems Seminar (CS298-1, 3:30-4:30, 306 Soda Hall) and will often involve in-depth discussions with the weekly speaker (4:30-5:30, 380 Soda Hall) or further explorations of the topics at hand.
Topics include: reliable communication, reasoning about concurrency and synchronization, transaction models, byzantine faults, replication, coding-based replication, and programming techniques for reliability.