Dissertation Defense

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

Doowon Lee

Modern computer chips comprise many system components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we concern evasive errors in SoC designs due to design bugs and permanent faults. We propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework, which decomposes the memory-access-sequence graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core cluster with a small distributed programmable logic. We then address on-chip networks, proposing two routing reconfiguration algorithms. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs where many accelerator interactions must be verified. We decompose such interactions into basic interaction elements, which can be reassembled into new, interesting tests.

Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, and five times on average in the memory-access-sequence graph checking, etc. These overhead reductions come with loss in error coverage, e.g., undetected bug-masking incidents and non-patchable memory bugs.

Sponsored by

Valeria Bertacco