Dissertation Defense

Overcoming Hard-Faults in High-Performance Microprocessors

Amin Ansari

As device density grows, each transistor gets smaller and more fragile leading to an overall higher susceptibility to hard-faults. These hard-faults result in permanent silicon defects and impact manufacturing yield, performance, and lifetime of semiconductor devices. We first present a flexible cache architecture, ZerehCache, to protect regular SRAM structures against high failure rates. ZerehCache virtually reorganizes the cache data array using a permutation network to provide higher degrees of freedom for spare allocation. In order to study the impact of fault patterns on the redundancy requirements in a cache, we propose a methodology to model the collision patterns in caches as a graph problem. Given this model, a graph coloring scheme is employed to minimize
the amount of additional redundancy required for protecting the cache.

Power efficiency is another key challenge in the design of modern microprocessors. Growing power consumption affects device lifetime, the cost of thermal packaging, cooling, and electricity. Dynamic voltage scaling is commonly used to reduce the power consumption. However, the supply voltage cannot be reduced below a certain threshold without addressing SRAM failures. To achieve this, a highly reconfigurable cache design, Archipelago, is presented. Since low-power operation is optional, Archipelago resizes the cache to provide spare elements. Furthermore,
to maximize the effective cache capacity in low-power mode, a near optimal minimum clique covering configuration algorithm is introduced.

With proper solutions in place for caches, a robust and heterogeneous core coupling execution scheme, Necromancer, is presented to protect the general core area against hard-faults. Although a faulty core cannot be trusted, we observe that for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Necromancer exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior. We partition the cores into multiple groups. Each group shares a lightweight core that can be substantially accelerated. However, due to the presence of defects, a perfect data or instruction stream cannot be provided by the dead core. This necessitates employing low-cost recovery mechanism and
generic hints that are more resilient to local abnormalities.

Sponsored by

S. Mahlke