Here's the memory mountain for a recent Intel processor:
other mountains, you'll see this phenomenon for other cases as well, specifically Intel processors that employ prefetching.
Let's investigate this phenomenon more closely. Quite honestly, though I have no explanation for why it occurs.
In the above processor, a cache block contains eight 8-byte long integers. For a stride of S, considering only spatial locality, we would expect a miss rate of around S/8 for strides up to 8. For example, a stride of 1 would yield one miss followed by 7 hits, while a stride of 2 would yield one miss followed by 3 hits. For strides of 8 or more, the miss rate would be 100%. If read incurring a cache miss incurs a delay M, and one incurring a hit incurs a delay H, then we would expect the average time per access to be M*S/8 + H*(1-S/8). The throughput should be the reciprocal of the average delay.
For the larger sizes, where data resides in the L3 cache, this predictive model holds fairly well. Here are the data for a size of 4 megabytes:
In this graph, the values of M and H were determined by matching the data for S=1 and S=8. So, it's no surprise that these two cases match exactly. But, the model also works fairly well for other values of S.
For sizes that fit in the L2 cache, however, the predictive model is clearly off:
I have experimented with the measurement code to see if the bump is some artifact of how we run the tests, but I believe this is not the case. I believe there is some feature of the memory system that causes this phenomenon.
I would welcome any ideas on what might cause memory mountains to have this bump.