CSE researchers present five papers at ISCA 2022

17 U-M researchers proposed a variety of techniques to speed up complex graph algorithms, encrypted cloud computing, memory-intensive matrix operations, and more.

Five papers authored or co-authored by CSE researchers at the University of Michigan were accepted to appear at the 2022 ACM/IEEE International Symposium on Computer Architecture (ISCA), the world’s leading conference in computer architecture. The 17 U-M researchers proposed architectures and techniques to speed up complex graph algorithms, encrypted cloud computing, and memory-intensive matrix operations, as well as discovering and mitigating a new memory vulnerability and designing a new branching technique for general data center speedup.

Three researchers also chaired topic sessions throughout the event: Prof. Ron Dreslinski chaired “Embedded Systems and HW Synthesis,” Prof. Lingjia Tang chaired the conference’s third session on learning, and research fellow Yiping Kang chaired “Applications and Algorithms.”

Learn more about the projects:

CraterLake: A Hardware Accelerator for Efficient Unbounded Computation on Encrypted Data

Nikola Samardzic, Axel Feldmann, Aleksandar Krastev (Massachusetts Institute of Technology); Nathan Manohar (IBM T.J. Watson); Nicholas Genise (SRI International); Srinivas Devadas (Massachusetts Institute of Technology); Karim Eldefrawy (SRI International); Chris Peikert (University of Michigan); Daniel Sanchez (Massachusetts Institute of Technology)

Fully Homomorphic Encryption (FHE) enables offloading computation to untrusted servers with cryptographic privacy. Despite its attractive security, FHE is not yet widely adopted due to its prohibitive overheads, about 10,000X over unencrypted computation. The researchers presented CraterLake, the first FHE accelerator that enables FHE programs of unbounded size. The team evaluated CraterLake on deep FHE programs, including deep neural networks like ResNet and LSTMs, where prior work takes minutes to hours per inference on a CPU. CraterLake outperforms a CPU by gmean 4,600X and the best prior FHE accelerator by 11.2X under similar area and power budgets

MeNDA: A Near-Memory Multi-way Merge Solution for Sparse Transposition and Dataflows

Siying Feng, Xin He, Kuan-Yu Chen (University of Michigan); Liu Ke, Xuan Zhang (Washington University in St. Louis); David Blaauw, Trevor Mudge, Ronald Dreslinski (University of Michigan)

Near-memory processing has been extensively studied to optimize memory intensive workloads. However, no current proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. The researchers propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1X, 12.0X, and 7.7x, respectively. MeNDA also shows an efficiency gain of 3.8x over a recent sparse matrix vector multiplication accelerator integrated with HBM.

MOESI-prime: Preventing Coherence-Induced Hammering in Commodity Workloads

Kevin Loughlin (University of Michigan); Stefan Saroiu, Alec Wolman (Microsoft); Yatin A. Manerkar, Baris Kasikci (University of Michigan)

Prior work shows that Rowhammer attacks—which flip bits in DRAM via frequent activations of the same row(s)—are viable. Adversaries typically mount these attacks via instruction sequences that are carefully-crafted to bypass CPU caches. However, the researchers have discovered a novel form of hammering that they refer to as coherence-induced hammering, caused by Intel’s implementations of cache coherent non-uniform memory access (ccNUMA) protocols. The team shows that this hammering occurs in commodity benchmarks on a major cloud provider’s production hardware, the first hammering found to be generated by non-malicious code. To address this vulnerability, they introduce MOESI-prime, a ccNUMA coherence protocol that mitigates coherence-induced hammering while retaining Intel’s state-of-the-art scalability.

NDMiner: Accelerating Graph Pattern Mining Using Near Data Processing

Nishil Talati, Haojie Ye, Yichen Yang, Leul Wuletaw Belayneh, Kuan-Yu Chen, David Blaauw, Trevor Mudge, Ronald Dreslinski (University of Michigan)

Graph Pattern Mining (GPM) algorithms are used to mine structural patterns in graphs. The performance of GPM workloads is bottlenecked by control flow and memory stalls, due to data-dependent branches used in set intersection and difference operations that dominate the execution time. The researchers developed a new Near Data Processing (NDP) architecture called NDMiner to address four inefficiencies they identified in the GPM workload after initial analysis. NDMiner significantly outperforms software and hardware baselines by 6.4X and 2.5X, on average, while incurring a negligible area overhead on CPU and DRAM.

Thermometer: Profile-Guided BTB Replacement for Data Center Applications

Shixin Song, Tanvir Ahmed Khan, Sara Mahdizadeh Shahri (University of Michigan); Akshitha Sriraman (Carnegie Mellon University / Google); Niranjan K Soundararajan, Sreenivas Subramoney (Intel Labs); Daniel A. Jiménez (Texas A&M University); Heiner Litz (University of California, Santa Cruz); Baris Kasikci (University of Michigan)

Modern processors employ a decoupled frontend with Fetch Directed Instruction Prefetching (FDIP) to avoid frontend stalls in data center applications. However, the large branch footprint of data center applications precipitates frequent Branch Target Buffer (BTB) misses that prohibit FDIP from eliminating more than 40% of all frontend stalls. The researchers present Thermometer, a novel BTB replacement technique that realizes the holistic branch behavior via a profile-guided analysis. Based on the collected profile, Thermometer generates useful BTB replacement hints that the underlying hardware can leverage. Evaluated with 13 widely-used data center applications, Thermometer was demonstrated to provide an average speedup of 8.7% (0.4%-64.9%) while outperforming the state-of-the-art BTB replacement techniques by 5.6×. We also demonstrate that Thermometer achieves a performance speedup that is, on average, 83.6% of the speedup achieved by the optimal BTB replacement policy.

Baris Kasikci; Christopher Peikert; David Blaauw; Research News; Ronald Dreslinski; Trevor Mudge; Yatin Manerkar