Friday, June 20, 2025
HomeGolangGetting Pleasant With CPU Caches

Getting Pleasant With CPU Caches


Introduction

When a CPU must entry a bit of knowledge, the information must journey into the processor from fundamental reminiscence.

The structure appears one thing like this:

Determine 1: CPU Cache

Determine 1 exhibits the totally different layers of reminiscence a bit of knowledge has to journey to be accessible by the processor. Every CPU has its personal L1 and L2 cache, and the L3 cache is shared amongst all CPUs. When the information lastly makes its method contained in the L1 or L2 cache, the processor can entry it for execution functions. On Intel architectures the L3 cache maintains a replica of what’s in L1 and L2.

Efficiency in the long run is about how effectively information can circulation into the processor. As you possibly can see from the diagram, fundamental reminiscence entry is about 80 instances slower than accessing the L1 cache because the information must be moved and copied.

Be aware: Reminiscence Efficiency in a Nutshell: The info is from 2016 however what’s necessary are the latency ratios that are fairly fixed.

An Instance

This is likely to be attention-grabbing, however how does that have an effect on you as a developer?

Let’s take a look…

Say you’ve gotten the next Person struct.:

Itemizing 1

03 kind Picture [128 * 128]byte
04
05 kind Person struct {
06 	Login   string
07 	Energetic  bool
08 	Icon	 Picture
09 	Nation string
10 }

Itemizing 1 exhibits a struct named Person. On line 08, the struct has a area named Picture which is said as an array of 128×128 bytes. That could be a contiguous block of 16k bytes (16,384) of reminiscence.

You is likely to be questioning why the Icon area is a part of the Person struct? One motive is likely to be to save lots of an API name when the shopper retrieves a consumer from the system. This manner the shopper doesn’t have to make a second name to get the icon picture.

Now think about it is advisable add an API that counts the variety of energetic customers within the system by nation.

Itemizing 2

12 // CountryCount returns a map of nations to the variety of energetic customers.
13 func CountryCount(customers []Person) map[string]int{
14 	counts := make(map[string]int) // nation -> rely
15 	for _, u := vary customers {
16     	if !u.Energetic {
17         	    proceed
18     	}
19     	counts[u.Country]++
20 	}
21
22 	return counts
23 }

Benchmarking the Code

This API works nice, however you then resolve to benchmark the operate.

Itemizing 3

05 var customers []Person
06
07 func init() {
08 	const measurement = 10_000
09
10 	international locations := []string{
11         "AD",
12         "BB",
13         "CA",
14         "DK",
15 	}
16
17 	customers = make([]Person, measurement)
18 	for i := 0; i < measurement; i++ {
19         customers[i].Energetic = ipercent5 > 0 // 20% non energetic
20         customers[i].Nation = international locations[i%len(countries)]
21 	}
22 }
23
24 func BenchmarkCountryCount(b *testing.B) {
25 	for i := 0; i < b.N; i++ {
26         m := CountryCount(customers)
27         if m == nil {
28             b.Deadly(m)
29         }
30 	}
31 }

Itemizing 3 exhibits the benchmark code. On strains 07-22, an init operate is said to initialize the information for the benchmark. On line 17, a slice is constructed to carry 10,000 customers the place 20% of them are set to be non energetic. On strains 24-31, the benchmark operate is said and is written to name the CountryCount operate within the benchmark loop.

Itemizing 4

$ go check -bench . -benchtime 10s -count 5	 

goos: linux
goarch: amd64
pkg: customers
cpu: twelfth Gen Intel(R) Core(TM) i7-1255U

BenchmarkCountryCount-12     4442     2630315 ns/op
BenchmarkCountryCount-12     4143     2814090 ns/op
BenchmarkCountryCount-12     3848     2642400 ns/op
BenchmarkCountryCount-12     4255     2639497 ns/op
BenchmarkCountryCount-12     4131     2661188 ns/op

PASS
okay     customers     67.257s

Itemizing 4 exhibits how you can run the benchmark and supplies the outcomes. On the primary line the Go check command is used to run the benchmark operate 5 instances, every time for at the least 10 seconds. The outcomes of the benchmark common to 2,677,498ns per operation, or about ~2.67ms per operation.

Be aware: You need to take note of models when working benchmarks. The go tooling normally reviews in nanoseconds however you would possibly have to convert to different models. Generally used models are nanoseconds (ns), microseconds (μs) and milliseconds (ms). If you happen to see the next quantity in nanoseconds 123,456,789, then 789 are ns, 456 are μs, and 123 are ms. The comma is your buddy.

You need to have efficiency targets in your code, and if ~2.67ms meets these targets then don’t contact the code to make it run any sooner!

Let’s assume you want higher efficiency. How are you going to start to profile the efficiency? What must you be searching for?

Let’s examine cache misses to start out.

Cache Misses

Let’s check out the scale of the CPU caches I’m utilizing to run this code:

Itemizing 5

$ lscpu --caches

NAME  ONE-SIZE  ALL-SIZE  WAYS  TYPE        LEVEL   SETS  PHY-LINE  COHERENCY-SIZE
L1d        48K      352K    12  Knowledge            1     64         1             64
L1i        32K      576K     8  Instruction     1     64         1             64
L2        1.3M      6.5M    10  Unified         2   2048         1             64
L3         12M       12M    12  Unified         3  16384         1             64

Itemizing 5 exhibits how you can get the CPU cache data. You may see the sizes of every cache: L1 is 48K, L2 is 1.3M, and L3 is 12M.

Keep in mind the Icon area within the Person struct requires ~16k bytes of contiguous reminiscence. Whenever you multiply that with the 10k customers we created for the benchmark, the slice requires ~163 MiB of contiguous reminiscence.Meaning the slice as soon as it’s constructed can’t slot in any of the caches in its entirety.

That is going to trigger the {hardware} to be thrashing reminiscence because it always wants to maneuver information from fundamental reminiscence to cache for each factor within the slice that’s learn. The system will behave as if there may be random entry reminiscence occurring regardless that the reminiscence is laid out contiguously.

To confirm that cache misses usually are not letting this system run as quick because it could possibly be working, you need to use the Linux [perf][https://perf.wiki.kernel.org/index.php/Main_Page] utility. Theperf command works on an executable, so it is advisable run the check executable instantly and never by way of go check.

By default, go check will delete the check executable on the finish of working the check, so use the -c flag to maintain it round.

Itemizing 6

$ go check -c

$ ls *.check
customers.check

Itemizing 6 exhibits how you can construct and preserve the check executable. After getting the check executable you possibly can run it beneath perf.

Itemizing 7

$ perf stat -e cache-misses ./customers.check -test.bench . -test.benchtime=10s -test.rely=5

goos: linux
goarch: amd64
pkg: customers
cpu: twelfth Gen Intel(R) Core(TM) i7-1255U

BenchmarkCountryCount-12     3799     3153389 ns/op
BenchmarkCountryCount-12     3460     3094920 ns/op
BenchmarkCountryCount-12     3662     3073355 ns/op
BenchmarkCountryCount-12     3774     3166086 ns/op
BenchmarkCountryCount-12     3688     3091225 ns/op

PASS

Efficiency counter stats for
    './customers.check -test.bench . -test.benchtime=10s -test.rely=5':

     14,075,274,489    cpu_core/cache-misses:u/                                               
     (99.43%)
     
     8,668,379,326     cpu_atom/cache-misses:u/                                            	
     (0.77%)

     58.990510463 seconds time elapsed
     58.926667000 seconds consumer
      0.119521000 seconds sys

Itemizing 7 exhibits how you can run the check executable utilizing the perf command. You may see the quantity of cache misses that occurred on each the core and atom CPUs.There have been ~14 billion cache misses in the course of the execution of the benchmark on the core cpu and one other ~8 billion on the atom cpu.

Be aware: For some Intel platforms (equivalent to AlderLake which is a hybrid platform) there may be an atom cpu and core cpu. Every cpu has a devoted occasion checklist. Part of occasions can be found on the core cpu and part of occasions are additionally out there on the atom cpu, and even a part of occasions can be found on each.

Utilizing a Slice

How can we scale back this variety of cache misses? If we scale back the scale of a Person struct so extra consumer values match within the cache on the similar time, we may scale back the variety of cache misses. Proper now we will solely match a number of consumer values within the L1 cache at a time. Utilizing a slice will allow us to considerably scale back the scale of a Person worth.

However why will this assist? Let’s take a look at the declaration of a slice worth from the language.

https://github.com/golang/go/blob/grasp/src/runtime/slice.go#L15

Itemizing 8

01 kind slice struct {
02    array unsafe.Pointer
03    len   int
04    cap   int
05 }

Itemizing 8 exhibits the declaration of a slice in Go. On a 64 bit OS, an integer will probably be 64 bits or 8 bytes. This implies a slice worth is a complete of 24 bytes. By utilizing a slice, we will scale back the scale of a consumer worth from ~16k+ bytes to 24 bytes. A big change.

Itemizing 9

03 kind Picture []byte
04
05 kind Person struct {
06 	Login   string
07 	Energetic  bool
08 	Icon	 Picture
09 	Nation string
10 }

Itemizing 9 exhibits the brand new implementation of the Picture and Person sorts. The one change is on line 03 which makes use of a slice and the underlying kind.

To be truthful let’s change the benchmark code to allocate reminiscence for the icons within the initialization of customers. This reminiscence is now not allotted with the development of a Person.

Itemizing 10

07 func init() {
08 	const measurement = 10_000
09
10 	international locations := []string{
11         "AD",
12         "BB",
13         "CA",
14         "DK",
15 	}
16
17 	customers = make([]Person, measurement)
18 	for i := 0; i < measurement; i++ {
19         customers[i].Energetic = ipercent5 > 0 // 20% non energetic
20         customers[i].Nation = international locations[i%len(countries)]
21         customers[i].Icon = make([]byte, 128*128)
22 	}
23 }

Itemizing 10 exhibits the brand new benchmark code, the one change is the addition of line 21 that allocates reminiscence for the Icon area.

Now let’s run the benchmark once more.

Itemizing 11

$ go check -bench . -benchtime=10s -count=5	 

goos: linux
goarch: amd64
pkg: customers
cpu: twelfth Gen Intel(R) Core(TM) i7-1255U

BenchmarkCountryCount-12     189669     63774 ns/op
BenchmarkCountryCount-12     185011     63880 ns/op
BenchmarkCountryCount-12     188542     63865 ns/op
BenchmarkCountryCount-12     187938     64261 ns/op
BenchmarkCountryCount-12     186956     64297 ns/op

PASS
okay     customers     63.364s

Itemizing 11 exhibits the second run of the benchmark. You may see the common is now 64015.40ns per operation or ~64 microseconds. About ~41.8 instances sooner than the earlier model.

To verify there are much less cache misses, let’s run the code beneath perf once more.

Itemizing 12

$ perf stat -e cache-misses ./customers.check -test.bench . -test.benchtime=10s -test.rely=5

goos: linux
goarch: amd64
pkg: customers
cpu: twelfth Gen Intel(R) Core(TM) i7-1255U

BenchmarkCountryCount-12     191288     62885 ns/op
BenchmarkCountryCount-12     193185     63890 ns/op
BenchmarkCountryCount-12     185040     63432 ns/op
BenchmarkCountryCount-12     190237     63356 ns/op
BenchmarkCountryCount-12     191358     63726 ns/op

PASS

Efficiency counter stats for
     './customers.check -test.bench . -test.benchtime=10s -test.rely=5':

     1,828,311  	cpu_core/cache-misses:u/                                            	
     (99.69%)

     306,533,948  	cpu_atom/cache-misses:u/                                            	
     (0.31%)

     63.630108076 seconds time elapsed
     63.610744000 seconds consumer
      0.103127000 seconds sys

Itemizing 12 exhibits the second run of perf. You may see there at the moment are ~1.8 million cache misses on the cpu core and one other ~306 million on the atom core. Again in itemizing 6 you will notice the unique values had been within the billions.

Now that extra customers match contained in the cache, we’re seeing much less cache misses and higher efficiency.

Abstract

One of many causes I like optimizing code for my purchasers is the extent of data I have to be taught.
Not solely information constructions and algorithms, but additionally pc structure, how the Go runtime works, networking and way more. And I additionally get to play with cool new instruments (equivalent to perf)

You bought a major efficiency enhance from a small code change, however There ain’t no such factor as a free lunch.

The code is riskier since now Icon is likely to be nil.

You additionally have to allocate reminiscence on the heap per Person struct, heap allocation takes extra time.
Lastly, the rubbish collector must work more durable since we’ve extra information on the heap – however that may be a dialogue for one more weblog put up.

If you wish to learn extra on the topic, you can begin with studying Cache-Oblivious Algorithm and following the hyperlinks there.

Yow will discover the code for this weblog put up right here:

https://github.com/353words/cpu_cache

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments