Dissertation Defense
Efficiently Scaling Machine Learning Systems Across Heterogeneous Resources
This event is free and open to the publicAdd to Google Calendar

Hybrid Event: Zoom /3941 BBB
Abstract: Over recent years, machine learning (ML), particularly deep learning and Large Language Models (LLMs), has rapidly advanced, driven by scaling laws indicating better performance with increased model size, data, and compute. However, deploying these powerful models presents significant challenges, especially during inference, due to massive computational demands and model sizes. These challenges are compounded by the need to operate across highly heterogeneous computing resources—spanning from resource-constrained edge devices to diverse hardware within datacenters —and emerging inference-time scaling techniques that further increase computational needs. Designing ML systems to efficiently scale in these complex environments while optimizing for critical metrics like latency, throughput, and Quality of Experience (QoE) requires a new approach.
My research introduces a systematic approach to address these challenges: Algorithm-System Co-design. By simultaneously optimizing ML algorithms and system components, we achieve significant efficiency gains and improved scalability for ML inference across heterogeneous resources.
In this talk, I will present three core contributions:
First, OASIS, a collaborative neural-enhanced video streaming system for mobile devices. OASIS intelligently coordinates video bitrate selection, adaptive super-resolution models, and resource scheduling among nearby mobile devices. It significantly enhances the video Quality of Experience (QoE) while drastically reducing energy consumption.
Second, Cake, a system designed to optimize latency for LLM inference by efficiently balancing computational and I/O resources during Key-Value (KV) cache loading. By dynamically combining GPU computations and storage operations, Cake substantially accelerates inference, reducing Time-To-First-Token (TTFT) by 2.6× on average compared to existing methods.
Third, Plato, an adaptive framework for efficient and high-quality parallel decoding in LLMs. By decomposing complex prompts into structured sub-problems with logical dependencies, Plato significantly boosts inference throughput while maintaining or improving answer quality, achieving up to 68% speedup and superior accuracy compared to state-of-the-art methods.