Dissertation Defense
Efficient Large-Scale Graph Processing
Add to Google Calendar
The abundance of large graphs and the immense potential of insight extraction from them have fueled interest in large-scale graph processing systems. Despite significant enhancement in recent years, state-of-the-art large-scale graph processing systems yield suboptimal performance in common scenarios.
In this thesis, we address three categories of deficiency in graph processing systems.
First, in a distributed environment, the performance of graph processing systems is hindered by computation-communication coupling. We present a new programming abstraction called Compute-Sync-Merge that fully decouples computation and inter-host communication, leading to substantial performance gain.
The second and third categories of deficiency both stem from the lack of version awareness in existing graph processing systems. When multiple versions of a large graph"”among which there exists a substantial common subgraph"”are processed sequentially, fetching each version as a standalone graph from persistent storage leads to suboptimal performance. We mitigate such issue by introducing a graph caching layer to a graph processing system, enabling the construction of in-memory graph representation based on cached contents and significantly improving overall system performance.
When multiple versions of a large graph are processed in parallel, maintaining each version as a standalone graph in memory and processing it in isolation to other versions incur memory and computation overhead. We redesign the representation and the processing workflow of a multi-version graph, consolidating duplicated graph state across multiple versions via dynamic version splitting. Our approach improves the overall efficiency of a graph processing system by reducing memory footprint and eliminating computation related to redundant state transition.