Virtualizing Data Parallel Systems for Portability, Productivity, and Performance
Add to Google Calendar
Computer systems equipped with graphics processing units (GPUs) have become increasingly common over the last decade. In order to utilize the highly data parallel architecture of GPUs for general purpose applications, new programming models such as OpenCL and CUDA were introduced, showing that data parallel kernels on GPUs can achieve speedups by several orders of magnitude. With this success, applications from a variety of domains have been converted to use several complicated OpenCL/CUDA data parallel kernels to benefit from data parallel systems. Simultaneously, the software industry has experienced a massive growth in the amount of data to process, demanding more powerful workhorses for data parallel computation. Consequently, additional parallel computing devices such as extra GPUs and co-processors are attached to the system, expecting more performance and capability to process larger data.
However, these programming models expose hardware details to programmers, such as the number of computing devices, interconnects, and physical memory size of each device. This degrades productivity in the software development process as programmers must manually split the workload with regard to hardware characteristics. This process is tedious and prone to errors, and most importantly, it is hard to maximize the performance at compile time as programmers do not know the runtime behaviors that can affect the performance such as input size and device availability. Therefore, applications lack portability as they may fail to run due to limited physical memory or experience suboptimal performance across different systems.
To cope with these challenges, we propose a dynamic compiler framework that provides the OpenCL applications with an abstraction layer for physical devices. This abstraction layer virtualizes physical devices and memory sub-systems, and transparently orchestrates the execution of multiple data parallel kernels on multiple computing devices. The framework significantly improves productivity as it provides hardware portability, allowing programmers to write an OpenCL program without being concerned of the target devices. Our framework also maximizes performance by balancing the data parallel workload considering factors like kernel dependencies, device performance variation on workloads of different sizes, the data transfer cost over the interconnect between devices, and physical memory limits on each device.