Dissertation Defense

Scaling Causality Analysis for Production Systems

Michael Chow

Causality analysis reveals how program values influence each other. It is important

for debugging, optimizing, and understanding the execution of programs. This thesis

scales causality analysis to production systems consisting of desktop and server applications as well as large-scale Internet services. This enables developers to employ

causality analysis to debug and optimize complex, modern software systems. This

thesis shows that it is possible to scale causality analysis to both fine-grained instruction level analysis and analysis of Internet scale distributed systems with thousands

of discrete software components by developing and employing automated methods to

observe and reason about causality.

First, we observe causality at a fine-grained instruction level by developing the

first taint tracking framework to support tracking millions of input sources. We also

introduce flexible taint tracking to allow for scoping different queries and dynamic

filtering of inputs, outputs, and relationships.

Next, we introduce the Mystery Machine, which uses a "big data" approach to

discover causal relationships between software components in a large-scale Internet

service. We leverage the fact that large-scale Internet services receive a large number

of requests in order to observe counterexamples to hypothesized causal relationships.

Using discovered casual relationships, we identify the critical path for request execution and use the critical path analysis to explore potential scheduling optimizations.

Finally, we explore using causality to make data-quality tradeoffs in Internet services. A data-quality tradeoff is an explicit decision by a software component to return

lower-fidelity data in order to improve response time or minimize resource usage. We

perform a study of data-quality tradeoffs in a large-scale Internet service to show the

pervasiveness of these tradeoffs. We develop DQBarge, a system that enables better data-quality tradeoffs by propagating critical information along the causal path

of request processing. Our evaluation shows that DQBarge helps Internet services

mitigate load spikes, improve utilization of spare resources, and implement dynamic

capacity planning.

Sponsored by

Prof. Jason Flinn