Friday, May 5, 2017

New CS:APP song!

Prof. Casey Deccio at BYU wrote this catchy ditty about CS:APP called Hello Summer Break. It's pretty awesome!

Tuesday, May 2, 2017

The Memory Mountain L2 Bump

In a previous post, I showed how the different editions of CS:APP have used the memory mountain as a way to visualize memory system performance.  It demonstrates how the cache hierarchy affects performance, which in turn can be used by application programmers to write more efficient programs.

Here's the memory mountain for a recent Intel processor:

 As the label indicates, there is a curious bump in the curve corresponding to cases where data are held in the L2 cache, and where the throughput as a function of the strides does not go down smoothly.  Looking back at the other mountains, you'll see this phenomenon for other cases as well, specifically Intel processors that employ prefetching.

Let's investigate this phenomenon more closely.  Quite honestly, though I have no explanation for why it occurs.

In the above processor, a cache block contains eight 8-byte long integers.  For a stride of S, considering only spatial locality, we would expect a miss rate of around S/8 for strides up to 8.  For example, a stride of 1 would yield one miss followed by 7 hits, while a stride of 2 would yield one miss followed by 3 hits.  For strides of 8 or more, the miss rate would be 100%.  If read incurring a cache miss incurs a delay M, and one incurring a hit incurs a delay H, then we would expect the average time per access to be M*S/8 + H*(1-S/8).  The throughput should be the reciprocal of the average delay.

For the larger sizes, where data resides in the L3 cache, this predictive model holds fairly well.  Here are the data for a size of 4 megabytes:

In this graph, the values of M and H were determined by matching the data for S=1 and S=8.  So, it's no surprise that these two cases match exactly.  But, the model also works fairly well for other values of S.  

For sizes that fit in the L2 cache, however, the predictive model is clearly off:

We can see the bump.  The data tracks well for S=2, but for S=3, the actual memory system outperforms the predictive model.  The effect drops off for larger values of S.

I have experimented with the measurement code to see if the bump is some artifact of how we run the tests, but I believe this is not the case.  I believe there is some feature of the memory system that causes this phenomenon.

I would welcome any ideas on what might cause memory mountains to have this bump.



A Gallery of Memory Mountains

Through all 3 editions, CS:APP has used memory mountains to illustrate how the cache hierarchy affects memory system performance.  Here we compare the memory mountains of different processors over the years, revealing evolutionary changes in memory system design.

Here's the mountain from the First Edition, based on a 1999 Intel Pentium III Xeon:



The memory mountain shows the throughput achieved by a program repeatedly reading elements from an array of N elements, using a stride of S (i.e., accessing elements 0, S, 2S, 3S, ..., N-1).  The performance, measured in megabytes (MB) per second, varies according to how many of the elements are found in one of the processor's caches.  For small values of N, the elements can be held in the L1 cache, achieving maximum read throughput.  For larger values of N, the elements can be held in the L2 cache, and the L1 cache may be helpful for exploiting spatial locality for smaller values of S.  For large values of N, the elements will reside in main memory, but both the L1 and L2 cache can improve performance when S enables some degree of spatial locality.

By way of reference, the use of the memory mountain for visualizing memory performance was  devised by Thomas Stricker while he was a PhD student at CMU in the 1990s working for Prof. Thomas Gross.  Both of them now live in Switzerland, with Thomas Gross on the faculty at ETH.

Jumping forward 9 years, the above figure illustrates the performance for a 2009 iMac that I still use as a home computer.  It has the same qualitative characteristics of the Pentium III, with two levels of cache.  Overall, it has over 10X higher throughput.  You'll also notice how smooth the curves are.  That's partly because I rewrote the code last year to give more reliable timing measurements.

 For the second edition of CS:APP, we used a 2009 Intel Core i7, using the Nehalem microarchitecture.  The data here are noisy—they predate my improved timing code.  There are several important features in this mountain not found in the earlier ones.  First, there are 3 levels of cache.  There is also a striking change along the back edge.  The performance for a stride of 1 stays high for sizes up to around 16MB, where the data can still be held in the L2 cache.  It reflects the ability of the memory system to initiate prefetching, where it observes the memory access pattern and predicts which memory blocks will be read in the future.  It copies these blocks from L3 up to L2 and L1.  Then when the processor reads these memory locations, the data will already be in the L1 cache.  The overall performance is well short of that measured for the contemporary (2008) Core Duo shown above.  This could partly be due to differences in the timing code–the newer version uses 4-way parallelism when reading the data, whereas the old code was purely sequential.


For the third edition of CS:APP, we used a 2013 Intel Core i5, using the Haswell microarchitecture.  The above figure shows measurements for this machine using the improved timing code.  Overall, though, the memory system is similar to the Nehalem processor from CS:APP2e.  It has 3 levels of cache and uses prefetching. Note how high the overall throughputs are.

 The final mountain uses measurements from a Raspberry Pi 3.  The Pi uses a processor included as part of a Broadcomm "system on a chip" designed for use in smartphones.  The processor is based on the ARM Cortex-A53 microarchitecture.  Performance-wise, it is much slower than a contemporary desktop processor, but it is very respectable for an embedded processor.  There's a clear 2-level cache structure.  It also appears that some amount of prefetching might occur with both cache and main memory accesses.

Over the nearly 20-year time span represented by these machines, we can see that memory systems have undergone evolutionary changes.  More levels of cache have been added, and caches have become larger.  Throughputs have improved by over an order of magnitude.  Prefetching helps when access patterns are predictable.  It's interesting to see how the visualization provided by memory mountains enables us to see these qualitative and quantitative changes.