According to Jackie Stewart, a three-time world champion F1 driver, having an understanding of how a car works made him a better pilot.
“You don’t have to be an engineer to be a racing driver, but you do have to have Mechanical Sympathy”
Martin Thompson (the designer of the LMAX Disruptor) applied the concept of mechanical sympathy to programming. In a nutshell, understanding the underlying hardware should make us a better developer when it comes to designing algorithms, data structures, etc.
In this post, we will dig into the processor and see how understanding some of its concepts can help us when it comes to optimizing solutions.
Modern processors are based on the concept of symmetric multiprocessing (SMP). In a SMP system, the processor is designed so that two or more cores are connected to a shared memory (also called main memory or RAM). Also, to speed up memory accesses, the processor has different levels of cache called L1, L2 and L3. The exact architecture may vary depending on the provider, the processor model, etc. Yet, the most frequent model is to have L1 and L2 local to a core and L3 shared across all cores:
The closer a cache is to a CPU core the smaller is its access latency and size:
Again, these exact figures depend on the processor model. Yet, to get a rough estimate: if we consider that accessing the main memory takes 60 ns, accessing L1 is about 50 times faster.
In the world of processors, there is an important concept called locality of reference. When a processor accesses to a particular memory location, it is very likely that:
It will access the same location in the near future: this is the temporal locality principle.
It will access memory locations nearby: this is the spatial locality principle.
Temporal locality is one of the reasons to have CPU caches. Yet, how do we leverage spatial locality? Instead of copying a single memory location to the CPU caches, the solution is to copy a cache line. A cache line is a contiguous segment of memory.
The size of a cache line depends on the cache level (and again of the processor model). For example, here is the size of the L1 cache line on my machine:
$ sysctl -a | grep cacheline hw.cachelinesize: 64
Instead of copying a single variable to the L1 cache, the processor will copy a contiguous segment of 64 bytes. For example, instead of copying a single int64 element of a Go slice, it will copy this element plus the seven int64 elements as well.
A Concrete Application of Cache Lines in Go
Let’s see a concrete example that will show us the benefits of CPU caches. In the following code, we will c