CSE authors present four papers at MICRO 2022

September 23, 2022

Thirteen CSE co-authors had work accepted at the conference, including one Best Paper nominee.

Four papers co-authored by thirteen CSE researchers were presented at the 2022 IEEE/ACM International Symposium on Microarchitecture (MICRO), the premier research forum for new microarchitecture techniques that advance computing and communication systems.

One submission from CSE co-authored by PhD student Tanvir Ahmed Khan, alumnus Muhammed Ugur, and assistant professor Baris Kasikci won the conference’s Best Paper Award.

Four papers covered reducing modern data center applications’ footprints, novel accelerator architecture and a programming model for mining temporal motifs efficiently, the first online PGO system for unmanaged code, and in-memory computing. Additionally, Profs. Valeria Bertacco and Todd Austin delivered a workshop on architecture specialization and benchmarking.

Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications

Best Paper Nominee

Tanvir Ahmed Khan, Muhammed Ugur (University of Michigan); Krishnendra Nathella, Dam Sunwoo (Arm Research); Heiner Litz (University of California, Santa Cruz); Daniel A. Jiménez (Texas A&M University); Baris Kasikci (University of Michigan)

Abstract: Modern data center applications experience frequent branch mispredictions – degrading performance, increasing cost, and reducing energy efficiency in data centers. Even the state-of-the-art branch predictor, TAGE-SC-L, suffers from an average branch Mispredictions Per Kilo Instructions (branch-MPKI) of 3.0 (0.5-7.2) for these applications since their large code footprints exhaust TAGE-SC-L’s intended capacity.

In this work, we propose Whisper, a novel profile-guided mechanism to avoid branch mispredictions. Whisper investigates the in-production profile of data center applications to identify precise program contexts that lead to branch mispredictions. Corresponding prediction hints are then inserted into code to strategically avoid those mispredictions during program execution. Whisper presents three novel profile-guided techniques: (1) hashed history correlation which efficiently encodes hard-to-predict correlations in branch history using lightweight Boolean formulas, (2) randomized formula testing which selects a locally optimal Boolean formula from a randomly selected subset of possible formulas to predict a branch, and (3) the extension of Read-Once Monotone Boolean Formulas with Implication and Converse Non-Implication to improve the branch history coverage of these formulas with minimal overhead.

We evaluate Whisper on 12 widely-used data center applications and demonstrate that Whisper enables traditional branch predictors to achieve a speedup close to that of an ideal branch predictor. Specifically, Whisper achieves an average speedup of 2.8% (0.4%-4.6%) by reducing 16.8% (1.7%-32.4%) of branch mispredictions over TAGE-SC-L and outperforms the state-of-the-art profile-guided branch prediction mechanisms by 7.9% on average.

Mint: An Accelerator For Mining Temporal Motifs

Nishil Talati, Haojie Ye (University of Michigan); Sanketh Vedula (Technion); Kuan-Yu Chen, Yuhan Chen, Daniel Liu, Yichao Yuan, David Blaauw (University of Michigan); Alex Bronstein (Technion); Trevor Mudge, Ronald Dreslinski (University of Michigan)

Abstract: A variety of complex systems, including social and communication networks, financial markets, biology, and neuroscience are modeled using temporal graphs that contain a set of nodes and directed time stamped edges. Temporal motifs in temporal graphs are generalized from subgraph patterns in static graphs in that they also account for edge ordering and time duration, in addition to the graph structure. Mining temporal motifs is a fundamental problem used in several application domains. However, existing software frameworks offer suboptimal performance due to high algorithmic complexity and irregular memory accesses of temporal motif mining.

This paper presents Mint—a novel accelerator architecture and a programming model for mining temporal motifs efficiently. We first divide this workload into three fundamental tasks: search, book-keeping, and backtracking. Based on this, we propose a task–centric programming model that enables decoupled, asynchronous execution. This model unlocks massive opportunities for parallelism, and allows storing task context information on-chip. To best utilize the proposed programming model, we design a domain-specific hardware accelerator using its data path and memory subsystem design to cater to the unique workload characteristics of temporal motif mining. To further improve performance, we propose a novel optimization called search index memoization that significantly reduces memory traffic. We comprehensively compare the performance of Mint with state-of-the-art temporal motif mining software frameworks (both approximate and exact) running on both CPU and GPU, and show 9x–2576x benefit in performance.

OCOLOS: Online COde Layout OptimizationS

Yuxuan Zhang (University of Pennsylvania); Tanvir Ahmed Khan (University of Michigan); Gilles A. Pokam (Intel); Baris Kasikci (University of Michigan); Heiner Litz (University of California, Santa Cruz); Joseph Devietti (University of Pennsylvania)

Abstract: The processor front-end has become an increasingly important bottleneck in recent years due to growing application code footprints, particularly in data centers. First-level instruction caches and branch prediction engines have not been able to keep up with this code growth, leading to more front-end stalls and lower Instructions Per Cycle (IPC). Profile-guided optimizations performed by compilers represent a promising approach, as they rearrange code to maximize instruction cache locality and branch prediction efficiency along a relatively small number of hot code paths. However, these optimizations require continuous profiling and rebuilding of applications to ensure that the code layout matches the collected profiles. If an application’s code is frequently updated, it becomes challenging to map profiling data from a previous version onto the latest version, leading to ignored profiling data and missed optimization opportunities.

In this paper, we propose OCOLOS, the first online code layout optimization system for unmodified applications written in unmanaged languages. OCOLOS allows profile-guided optimization to be performed on a running process, instead of being performed offline and requiring the application to be relaunched. By running online, profile data is always relevant to the current execution and always maps perfectly to the running code. OCOLOS demonstrates how to achieve robust online code replacement in complex multithreaded applications like MySQL and MongoDB, without requiring any application changes. Our experiments show that OCOLOS can accelerate MySQL by up to 1.41x, the Verilator hardware simulator by up to 2.20x, and a build of the Clang compiler by up to 1.14x.

Multi-Layer In-Memory Processing

Daichi Fujiki (Keio University); Alireza Khadem (University of Michigan); Scott Mahlke (University of Michigan / NVIDIA Research); Reetuparna Das (University of Michigan)

Abstract:

In-memory computing provides revolutionary changes to computer architecture by fusing memory and computation, allowing data-intensive computations to reduce data communications. Despite promising results of in-memory computing in each layer of the memory hierarchy, an integrated approach to a system with multiple computable memories has not been examined. This paper presents a holistic and application-driven approach to building Multi-Layer In-Memory Processing (MLIMP) systems, enabling applications with variable computation demands to reap the benefits of heterogeneous compute resources in an integrated MLIMP system. By introducing concurrent task scheduling to MLIMP, we achieve improved performance and energy efficiency for graph neural networks and multiprogramming of data-parallel applications.

Driving Architecture Specialization with Benchmarks for Emerging Applications was a well-received workshop organized by ADA PIs Valeria Bertacco (Univ. of Michigan), Sharad Malik (Princeton), Todd Austin (Univ. of Michigan) and David Brooks (Harvard). Benchmark researchers from academia and industry were invited to submit proposals on benchmark suites in their area. Seven presentations were given during the event followed by an interactive panel for Q&A with all seven presenters including talks from four University of Michigan ADA members, Andrew McCrabb, Todd Austin, Reetu Das and Baris Kasikci.

The workshop brought together researchers who have embarked upon new benchmarking efforts over the past several years. The goal of the workshop was to allow benchmarking researchers to share lessons learned and best practices for identifying, developing, and gaining adoption of benchmarks, and for the broader community to learn about recent offerings that support architecture research. Ultimately, we seek to promulgate best practices in engaging with the prospective user communities for new benchmarks.

Click here to view the full program and links to the benchmark suites websites

Explore:

Baris Kasikci; Chip Design & Architectures; Chip Design, Architecture, and Emerging Devices; Reetuparna Das; Research News; Ronald Dreslinski; Scott Mahlke; Trevor Mudge