CSE Seminar
Unlocking the Next Level of Dependable Cloud Systems
This event is free and open to the publicAdd to Google Calendar
Zoom link, passcode: 165503
Abstract: As cloud infrastructure transforms and enables ubiquitous computing services in our daily lives, we can no longer settle for systems that “more or less” work. Even a seemingly small glitch can cause severe consequences. Yet, cloud systems’ unprecedented scale and complexity pose significant challenges for them to advance to the desired dependability level. Addressing this conundrum requires us to revisit the conventional assumptions and fundamentally rethink how such systems should be designed.
In this talk, I will discuss an end-to-end approach to tackle these challenges through holistic designs of new abstractions, program analyses, runtime, and data science methods. First, I will show a new failure detection method by capturing inherent observability in large distributed systems. I will introduce a framework Panorama that automatically converts any component into an in-situ observer. Panorama quickly detects real-world complex failures such as gray failures that escape existing detectors. Next, I will discuss how to further localize complex failures using an intrinsic software watchdog abstraction and a tool OmegaGen that generates customized watchdogs. OmegaGen uncovers new partial failures in popular distributed systems like ZooKeeper. Finally, I will discuss how to replace existing ad-hoc, static fault mitigation actions with intelligent online experiment mechanisms. I will introduce Narya, a predictive and adaptive failure mitigation service. Narya has been deployed in production in Microsoft Azure for more than two years and has successfully prevented a large number of VM interruptions for real customers. I will conclude by outlining some future directions in designing next-generation dependable cloud systems.
Bio: Dr. Ryan Huang is an Assistant Professor at Johns Hopkins University, where he works broadly on computer systems including distributed systems, operating systems, cloud, and mobile computing. His research focuses on designing principled methods to improve the reliability and performance of modern systems. His work received best paper awards at OSDI 2016, ASPLOS 2019, NSDI 2020, ATC 2021, and a best paper award nominee at MICRO 2018. He is a recipient of the NSF CAREER Award (2020) and a Facebook research award. He received his Ph.D. in Computer Science from UC San Diego and his B.S. in Computer Science and B.A. in Economics from Peking University. More information about him can be found at https://www.cs.jhu.edu/~huang.