Course Description

Computing is headed towards a new era in which focus is shifted away from desktop machines and towards millions of small devices backed by large-scale service providers. In such a setting, distributed systems exist at many levels, from the network of tiny portable devices to the servers built from commodity components. Potential applications exist outside of the traditional business and entertainment areas in areas such as medicine, transportation, and emergency response, but if computing is to become ubiquitous throughout society, it must be reliable.

In this course we will study several dimensions of the reliability problem, starting with traditional fault-tolerance techniques and including the most recent research results in the area. The emphasis will be on the principles used to achieve high reliability, including algorithms, fault models, and techniques for reasoning about the behavior of distributed systems.

This course will be a project-oriented research seminar with scheduled speakers, substantial readings, and in-depth discussions with speakers. It will meet twice a week. The Tuesday meeting (3:30-5:00 PM, 380 Soda) will involve lecture presentations, discussions of readings, and project brainstorming. The Thursday meeting will follow the Systems Seminar (CS298-1, 3:30-4:30, 306 Soda Hall) and will often involve in-depth discussions with the weekly speaker (4:30-5:30, 380 Soda Hall) or further explorations of the topics at hand.

Topics include: reliable communication, reasoning about concurrency and synchronization, transaction models, byzantine faults, replication, coding-based replication, and programming techniques for reliability.

Class Meeting Times

Grading

Last Updated: August 26, 2000, Kathy Yelick