Dissertation Defense

Delivering Affordable Fault-tolerance to Commodity Computer Systems

Shuguang Feng

To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. This aggressive scaling has provided designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend has also been accompanied by a decline in individual device reliability as transistors have become increasingly susceptible to a host of threats.

With each new technology generation the challenges associated with process variation,
wearout, and transient faults gain greater prominence. We are quickly approaching a new
era where fault-tolerance is becoming a first-order design constraint, no longer a luxury
reserved exclusively for high-reliability, mission-critical domains. Even commodity mi-
croprocessors used in mainstream computing will require protection.

However, just as the reliability needs of NASA and Apple differ dramatically, so does
their ability to absorb the costs necessary to ensure fault-tolerance. Viable solutions tar-geting commodity systems must not only recognize this fact, but must embrace it. Simply
stripping down techniques developed for enterprise servers may not result in the most ap-
propriate designs for your laptop or cellphone. The best solutions will exploit the relaxed
reliability constraints of commodity systems, judiciously sacrificing a small degree of fault tolerance to achieve far greater reductions in overhead costs.

This thesis proposes a collection of works that can be selectively mixed and matched to
assemble reliability solutions tailor-fit for the commodity systems community. Although
the works presented address a variety of different issues from wearout to transient faults and prevention to detection, they were all motivated by the same observation–that much of the overhead costs associated with conventional fault tolerance mechanisms are spent in pursuit of the last few “nines” of reliability. This conclusion gave rise to the philosophy permeating the chapters of this work, that summarily dismissing techniques that cannot supply mission- critical fault tolerance is no longer acceptable. In presenting concrete solutions to a few of the more interesting challenges, proactive wear-leveling and software-only fault detection and recovery, we also establish fundamental principles that can be applied more broadly to formulate a comprehensive reliability strategy.

Sponsored by

Scott Mahlke