Computer Engineering Seminar

Enabling dedicated single-cycle connections over a shared Network-on-Chip

Tushar KrishnaMIT

In the multicore era, moving to hundreds or thousands of cores will only be possible if the interconnect between the cores does not become a performance or power bottleneck. Typical on-chip network designs, including commercial research prototypes, use multi-stage complex router pipelines at each hop, adding delay and energy to all messages. Conventional wisdom thus says that communication is expensive, and scalability is only possible if on-chip traversals are reduced to a minimum.

In this talk, I will challenge this conventional wisdom. I will present network-on-chip (NoC) designs that can achieve near single-cycle traversals across the chip for both unicast and collective (1-to-Many and Many-to-1) communication flows, approaching the performance of an "ideal" but impractical all-to-all connected network. This reverses the trade-offs one typically associates with local vs. remote cache access latencies, or broadcast vs. directory-based coherence protocols.

The focus of my talk will be on SMART*: a technique that enables messages to traverse multiple-hops, potentially all the way from the source to the destination, within a single-cycle, over a NoC with shared links. SMART leverages repeated wires in the datapath, which can traverse 10+ mm at a GHz frequency. I will present a network flow-control technique that allows messages to dynamically reserve multiple links (with turns) within one cycle and traverse them in the next cycle. SMART reduces average network latency by 5-8X across traffic patterns as compared to a state-of-the-art network with single-cycle routers at every hop on a 64-core chip; this translates to 27/52% full-system runtime reduction for a Private/Shared L2 design and is within 12% of that achieved by an ideal contention-free all-to-all single-cycle network. If time permits, I will also present SMART FanOut (SFO) and SMART FanIn (SFI) that demonstrate near single-cycle traversals over multicast (1-to-Many) and reduction (Many-to-1) trees respectively over a NoC. Going forward, the ideas in SMART can pave the way for locality-oblivious shared-memory design.

*Single-cycle Multi-hop Asynchronous Repeated Traversal
Tushar Krishna received a PhD in Electrical Engineering and Computer Science from MIT in 2014, where he worked with Prof Li-Shiuan Peh. He has an MSE from Princeton University, and a BTech from IIT Delhi, both in Electrical Engineering. His research interests are in networks-on-chip for many core systems, heterogeneous architectures and reconfigurable computing. He is currently a researcher at the VSSAD group at Intel, Massachusetts.

Sponsored by