Dissertation Defense
Scaling Causality Analysis for Production Systems
Add to Google Calendar

Causality analysis reveals how program values influence each other. It is important
for debugging, optimizing, and understanding the execution of programs. This thesis
scales causality analysis to production systems consisting of desktop and server applications as well as large-scale Internet services. This enables developers to employ
causality analysis to debug and optimize complex, modern software systems. This
thesis shows that it is possible to scale causality analysis to both fine-grained instruction level analysis and analysis of Internet scale distributed systems with thousands
of discrete software components by developing and employing automated methods to
observe and reason about causality.
First, we observe causality at a fine-grained instruction level by developing the
first taint tracking framework to support tracking millions of input sources. We also
introduce flexible taint tracking to allow for scoping different queries and dynamic
filtering of inputs, outputs, and relationships.
Next, we introduce the Mystery Machine, which uses a "big data" approach to
discover causal relationships between software components in a large-scale Internet
service. We leverage the fact that large-scale Internet services receive a large number
of requests in order to observe counterexamples to hypothesized causal relationships.
Using discovered casual relationships, we identify the critical path for request execution and use the critical path analysis to explore potential scheduling optimizations.
Finally, we explore using causality to make data-quality tradeoffs in Internet services. A data-quality tradeoff is an explicit decision by a software component to return
lower-fidelity data in order to improve response time or minimize resource usage. We
perform a study of data-quality tradeoffs in a large-scale Internet service to show the
pervasiveness of these tradeoffs. We develop DQBarge, a system that enables better data-quality tradeoffs by propagating critical information along the causal path
of request processing. Our evaluation shows that DQBarge helps Internet services
mitigate load spikes, improve utilization of spare resources, and implement dynamic
capacity planning.