On the past few days this tweet caused some confusion and interest on how CPU caches work.
The tweet asks which of the following options is faster:
A.
for (int i = 0; i < n; i += 256)
a[i]++;
B.
for (int i = 0; i < n; i += 257)
a[i]++;
The question is taken verbatim from Algorithmica, but not given credit, not cool. Algorithmica also provides a very good explanation, although in my opinion a little convoluted, and misses an important part, so I will try to simplify it.
The oblivious and simple answer is “Option B, because the step size is larger therefore it does less work”.
If you understand a bit more about computers, you might think “256 is a nice round number, which may have something to do with SIMD or the memory system, so maybe A is faster.” [1]
The real answer is “the question did not give me enough information to make the correct decision”. This information is also suspiciously missing from Algorithmica, which suggests OP didn’t fully understand the article before posting to Twitter about these dark arts.
The 256 vs 257 example illustrates how pathological cache mappings can severely slow down your program. What’s missing is required context: this behavior only appears when you iterate through the same array multiple times and the datatype of a is 4 bytes long. (Other details like CPU platform are also missing, but we can assume a standard x86 Intel/AMD processor.)
Understanding Cache Pathologies
This slowdown happens when your array elements get repeatedly loaded and unloaded from cache, forcing your program to access the much slower main system RAM. Pathological cache mappings are memory access patterns that trick the cache into behaving as if it’s much smaller than it actually is.
How Set Associative Caches Work
Why is this? CPU caches are mostly “set associative” [2] caches. The terminology might be confusing but what it means is simple. There are X “sets”, with Y lines each, each memory address maps to a single set, and its contents can reside in any of the lines of that set. When the set is full, some eviction policy [3] will decide some line to be replaced. If somehow all your memory addresses map to the same set (or to a small number of sets), then you will have much reduced cache capacity.
Memory Address Structure
Most cache mappings work by splitting the memory address into three key sections:
Offset - Indicates position within the cache line. Since a cache line [4] is 64 bytes, accessing byte 70 will load bytes 64-127 into a cache line, with your byte at offset 6 (70 mod 64 = 6).
Set Number - This is where the core of our problem resides. This is the index of the cache set that this address will fall into. If all our memory addresses somehow have the same index, a single cache set will be used, instead of distributing the data throughout the cache, making it effectively have much lower capacity.
Tag - This just serves to identify the memory address and separate them from others. I believe you don’t have to worry about this for any practical purpose, it’s just internal stuff.
A memory address in 64 bit systems has 64 bits (duh! In reality only 48 bits are relevant, surprise!). For a CPU cache with 16384 sets (my cpu) will have 14 bits (log2(16384) = 14) dedicated to the Index, 6 bits (log2(64)) dedicated to the offset, and the rest will be the tag. Lets assume a 2 way associative cache for the examples, this means each Set has 2 cache lines.
Tag | Set Number | Offset |
---|---|---|
0:44 | 44:57 | 58:63 |
Practical Address Examples
Lets try some real addresses!
Id | Tag | Set Number | Offset |
---|---|---|---|
67 | 00000000000000000000000000000000000000000000 | 00000000000001 | 000011 |
65 | 00000000000000000000000000000000000000000000 | 00000000000001 | 000001 |
Here we can see that address 67 and 65 both fall in the same Set and have the same Tag, this means that when 67 is accessed, it will pull addresses from 64 to 127, and when 65 is accessed afterwards, the cache controller will look in Set 1 for a cache line with Tag 00000(…) find it, and look at offset 1 to find byte 65.
Okay, what would be a problematic address pairing?
Id | Tag | Set Number | Offset |
---|---|---|---|
A67 | 00000000000000000000000000000000000000000000 | 00000000000001 | 000011 |
B65 | 00000000000000000000000000000000000000000001 | 00000000000001 | 000001 |
C65 | 00000000000000000000000000000000000000000010 | 00000000000001 | 000001 |
With these addresses, first we will access Set 1, and cache addresses from 64 to 127 as previously, but now, when we try to access B65, we will look for tag “000(…)1” in the Set, and not find it. It will then put this line in cache, filling up the second line of our 2 way Set. When C65 is accessed, we’ll look for tag “000(…)10” in Set 1, won’t find it, and will need to evict one of the previous cache lines to add our new one to the Set. If we keep with this access pattern, we will effectively only use a single cache set! Pathological cache accesses [5] happen when you change the Tag, but keep the same Set number! How does this occur? When your accesses are 2^(14+6) = 1048576 bytes apart (in the case of my processor)!
Looping around, you might notice that 256*4 = 1024 is not 1048576, but it’s a perfect divisor of it! This means the access pattern isn’t as bad as 1048576 would be, but it still significantly limits our effective cache size. Here is what a loop incrementing an address by 1024 each time looks like.
Id | Tag | Set Number | Offset |
---|---|---|---|
0 | 00000000000000000000000000000000000000000000 | 00000000000000 | 000000 |
1024 | 00000000000000000000000000000000000000000000 | 00000000010000 | 000000 |
2048 | 00000000000000000000000000000000000000000000 | 00000000100000 | 000000 |
3072 | 00000000000000000000000000000000000000000000 | 00000000110000 | 000000 |
4096 | 00000000000000000000000000000000000000000000 | 00000001000000 | 000000 |
You will notice that the bottom 4 bits of the Set Number are not touched! This effectively reduces the cache size by 2^4 = 16 times! Larger data types that are powers of 2 will make this effect worse!
And here is the problem! If your cache size is reduced 16x, you won’t be able to fit your array in it, meaning that subsequent iterations of that same array will need to be mostly fetched from system RAM, which is way slower than L1/2/3 cache! In my CPU, this reduces the L1 data cache size from 32KB to 2KB, L2 from 512K to 32KB and L1 from 16MB to 1MB, which is a huge difference.
Analyze Your Own CPU Cache
Want to understand your own CPU’s cache characteristics? Claude kindly created a script that analyzes the cache hierarchy of your CPU and identifies potential pathological strides for each cache level.
You can download the cache analyzer script here. Run it on your Linux system with:
curl https://blog.tteles.dev/scripts/cache_analyzer.sh | bash
Finally, here is a C program that should allow you to test this behavior.
Hope this was a useful explanation, if you have any questions, feel free to reach me on Twitter, updated handle should be in this blog’s homepage.
References and Notes
[1] https://en.algorithmica.org/hpc/cpu-cache/associativity/