Triaging and Debugging Failures in Deployed Software by Reverse Execution
Add to Google Calendar
Many software providers operate crash reporting services to automatically collect failures in deployed software from millions
of customers. Triaging and debugging such failures is critical because they impact real users and customers. However, it is notoriously hard
in practice because developers have to rely on limited information such as memory dumps. In this talk, I will present two systems we built at
Microsoft Research to address the challenges in triaging and debugging failures in deployed software. Both systems were deployed inside
Microsoft as a major solution for triaging and debugging software failures. I will also share our experiences in developing and deploying these
solutions in practice.
First, I will present RETracer, the first system to triage software failures based on program
semantics reconstructed from memory dumps. RETracer is designed to meet the requirements
of large-scale crash reporting services. It performs binary-level backward taint analysis without
a recorded execution trace to understand how functions on the stack contribute to the failure.
When comparing it with the previous crash triaging tool used by Microsoft, we find that RETracer
eliminates two-thirds of triage errors based on a manual analysis of 140 bugs fixed in Microsoft
Windows and Office.
Second, I will present REPT, a practical system that enables reverse debugging of failures in
deployed software. REPT reconstructs the execution history with high fidelity by combining
online lightweight hardware tracing of a program's control flow with offline binary analysis that
recovers its data flow. It is seemingly impossible to recover data values thousands of instructions
before the failure due to information loss and concurrent execution. REPT tackles these
challenges by iteratively performing forward and backward execution with error correction and
constructing a partial execution order with timestamps logged by hardware. When evaluating
it on 16 real-world bugs, we find that REPT can recover data values accurately (93% on average)
and efficiently (less than 20 seconds) for these bugs, and enables effective reverse debugging
for 14 of them.
Weidong Cui is a Principal Researcher managing the Systems Security and Privacy
Research group in the Microsoft Research Redmond lab. Weidong enjoys building real-world
systems to tackle hard problems. His current passion is on ensuring Microsoft Azure is the most
secure cloud. Weidong and his team built REPT, the first lightweight record and replay solution
that is widely deployed to enable reverse debugging of software failures. Weidong led the
development of RETracer, a technology that improves the triaging accuracy of access violations
significantly. Weidong and his collaborators introduced controlled-channel attacks that can steal
rich information from secure enclaves. Weidong also led the development of KOP, a Windows
kernel rootkit detection system that still represents the state-of-the-art after many years.
Weidong is also known for his early work on automatic protocol reverse engineering. Weidong
received his PhD and MS degrees from UC Berkeley, and his ME and BE degrees from Tsinghua