Computer Engineering Seminar

Exploring the Limits of Prefetching

Thomas Puzak

In this talk, we formulate a new approach for evaluating any prefetching algorithm. We study the conditions under which prefetching can remove all the pipeline stalls due to cache misses. This approach involves an initial profiling of the application to identify all misses, as well as the corresponding locations in the program where prefetches for them can be initiated. Then we systematically control the number of misses that are prefetched, the timeliness of these prefetches, and the number of unused prefetches. We validate the accuracy of our method by comparing to a Markov prefetch algorithm. Hence, we can measure the potential benefit that any application can receive from prefetching, and we can analyze application behavior under conditions that cannot be explored with any known prefetching algorithm. Next, we analyze a system parameter that is vital to prefetching performance, the line transfer interval, which is the number of processor cycles required to transfer a cache line.

We show that under ideal conditions, prefetching can remove nearly all of the stalls associated with cache misses. Unfortunately, real processor implementations are far less than ideal. In particular, the trend in processor frequency is outrunning the on-chip and off-chip bandwidths. Today, it is not uncommon for the processor frequency to be three or four times the bus frequency. Under these conditions, we show that nearly all of the performance benefits derived from prefetching are eroded, and in many cases prefetching actually degrades performance. We do quantitative and qualitative analyses of these trade-off's, and show that there is a linear relationship between performance and three things: percent of misses prefetched, the percent of unused prefetches, and the bandwidth. I'll close with a few ideas on related areas of research.

Sponsored by