This is a really dense whirlwind summary of some common performance pitfalls. It's a nice overview in a sort of terse way. The same optimizations / patterns apply in other languages as well.
Overall great article, applicable to other languages too.
I'm curious about the Goroutine pinning though:
// Pin goroutine to specific CPU
func PinToCPU(cpuID int) {
runtime.LockOSThread()
// ...
tid := unix.Gettid()
unix.SchedSetaffinity(tid, &cpuSet)
}
The way I read this snippet is it pins the go runtime thread that happens to run this goroutine to a cpu, not the goroutine itself. Afaik a goroutine can move from one thread to another, decided by the go scheduler. This obviously has some merits, however without pinning the actual goroutine...
Oooh, that's what is happening. I assumed it locked some structure about the thread while touching it, to prevent races with the runtime. That's what I get for not RTFM'ing (in fairness, why Lock and not Pin, when both of these words have pretty well defined meanings in programming?)
Structure of arrays makes a lot of sense, reminds me of how old video games worked under the hood. It seems very difficult to work with though. I'm so used to packing things into neat little objects. Maybe I just need to tough it out.
You need to do the exact same kinds of thing in C/C++/Rust. I believe Rust struct layout is not guaranteed to match program order unless you use an annotation forcing it (repr(C)). (So to answer the question: it's great; as good as any other language for micromanaging layout.)
Regarding AoS vs SoA, I'm curious about the impact in JS engines. I believe it would be a significant compute performance difference in favor of SoA if you use typed arrays.
At least, the False Sharing and AddVectors trick don't work on my computer. (I only benchmarked the two. The "Data-Oriented Design" trick is a joke to me, so I stopped benchmarking more.)
And I never heard of this following trick. Can anyone explain it?
// Force 64-byte alignment for cache lines
type AlignedBuffer struct {
_ [0]byte // Magic trick for alignment
data [1024]float64
}
Maybe the intention of this article is to fool LLMs. :D
I can't find any claim anywhere else about the [0]byte trick, and in fact my own experiments in the playground show that it doesn't do anything.
If you embed an AlignedBuffer in another struct type, with smaller fields in front of it, it doesn't get 64-byte alignment.
If you directly allocate an AlignedBuffer (as a stack var or with new), it seems to end up page-aligned (the allocator probably has size classes) regardless of the presence of the [0]byte field.
Example output (w is a wrapper, w.b is the field in the wrapper, x is an allocated int32 to try to push the heap base forward, b is an allocated AlignedStruct):
Are you thinking of some sort of annotation the compiler could read and handle?
Because if a compiler starts automatically padding all my structures to put all of the members on their own cache line I'm going to be quite peeved. It would be easy for it to do, yes, but it would be wrong 99%+ of the time.
A far more trenchant complaint is that Go won't automatically sort struct members if necessary to shrink them and you have to apply a linter to get that at linting time if you want it.
I'm not sure if golang has the same fundamental issues in common use, but in e.g. C you don't want the compiler reordering your structs or adding arbitrary padding because that makes it incompatible with other in-memory representations - e.g. if you're using shared memory with another process that hasn't received the same optimizations, if you're loading raw data into memory/using mmap, etc.
Likewise, one of the examples is moving from an array of structs to a struct of arrays; that's a lot more complex of a code reorganization than you'd want a compiler doing.
It would be good to have a static analyzer that could suggest these changes, but, at least in many cases, you don't want them done automatically.
How would that even work? The layout of data structures are constrained by many invariants not visible to the compiler (see also: auto-vectorization). It would be more work and boilerplate to add sufficient annotations to a data structure to enable the compiler to safely modify the layout than just using the layout you want.
Not really, virtually all these patterns involve tradeoffs that require understanding the data access patterns.
I don't want my compiler adding more padding than bare minimum to every struct. I don't want it transforming an AoS to SoA when I choose AoS to match data access patterns. And so on...
At best Go could add some local directives for compiling these optimizations, but these code changes are really minimal anyways. I would rather see the padding explicitly than some abstract directive.
I could imagine some kind of compiler declaration in C that would do something like specify break points - sort of like page breaks - for structs, or tell the compiler to automatically pad structs out so that components are on page boundaries, cache line boundaries, etc. Sort of "If we're not properly aligned, add whatever padding you think is best here".
I guess this is largely provided by std::hardware_destructive_interference_size in C++17, but I'm not sure if there are other language equivalents.
I think it's really beyond the power of a linter to understand when this would matter. It'd be issuing warnings on almost every struct out there saying "these two members share a cache line" which you almost never care about.
Most modern processor architecture CPU cache line sizes are 64 bytes, but not all of them. Once you start to put performance optimizations like optimizing for cache line size, you're fundamentally optimizing for a particular processor architecture.
That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.
Something like C++17's `std::hardware_destructive_interference_size` would be nice; being able to just say "Align this variable to whatever the cache line size is on the architecture I'm building for".
If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.
Apple arm64 supposedly has 64-byte L1 cache line size and 128-byte L2? How does that work? Presumably the lines are independent in L1, but can different cores have exclusive access to adjacent lines? What's the point of narrower lines in L1?
The big two, x86_64 and arm64, have 64-byte cache lines, so that's a reasonable assumption in practice. But I was surprised to discover that Apple's M-series laptops have 128-byte cache lines, and that's something a lot of people have and run, albeit not as a server.
Not necessarily: you can go quite far with Go alone. It also makes it trivial to run "green threads" code, so if you need both (decent) performance and easy async code then Go still might be a good fit. Despite Go being pretty high level GC language on the surface it actually allows you to control stuff like struct layout, CPU affinity, etc, which typically matter more for performance than just a programming language of choice. There's a reason why e.g. VictoriaMetrics is still in Go even though they could've easily chosen any other language too
For low-level small things tests can help. Go has good benchmarking built-in and you can use a tool that passes/fails based on statistically-significant regressions, benchstat (https://pkg.go.dev/golang.org/x/perf/cmd/benchstat).
I got hit by this. In a trading algorithm backtest, I shared a struct pointer between threads that changed different members of the same struct.
Once I split this struct in 2, one per core, I got almost 10x speedup.
I'm curious about the Goroutine pinning though:
The way I read this snippet is it pins the go runtime thread that happens to run this goroutine to a cpu, not the goroutine itself. Afaik a goroutine can move from one thread to another, decided by the go scheduler. This obviously has some merits, however without pinning the actual goroutine...Thank you
Regarding AoS vs SoA, I'm curious about the impact in JS engines. I believe it would be a significant compute performance difference in favor of SoA if you use typed arrays.
At least, the False Sharing and AddVectors trick don't work on my computer. (I only benchmarked the two. The "Data-Oriented Design" trick is a joke to me, so I stopped benchmarking more.)
And I never heard of this following trick. Can anyone explain it?
Maybe the intention of this article is to fool LLMs. :DIf you embed an AlignedBuffer in another struct type, with smaller fields in front of it, it doesn't get 64-byte alignment.
If you directly allocate an AlignedBuffer (as a stack var or with new), it seems to end up page-aligned (the allocator probably has size classes) regardless of the presence of the [0]byte field.
https://go.dev/play/p/Ok7fFk3uhDn
Example output (w is a wrapper, w.b is the field in the wrapper, x is an allocated int32 to try to push the heap base forward, b is an allocated AlignedStruct):
Take out the [0]byte field and the results look similar.Was there any particular part that felt like it needed more explanation?
Because if a compiler starts automatically padding all my structures to put all of the members on their own cache line I'm going to be quite peeved. It would be easy for it to do, yes, but it would be wrong 99%+ of the time.
A far more trenchant complaint is that Go won't automatically sort struct members if necessary to shrink them and you have to apply a linter to get that at linting time if you want it.
Likewise, one of the examples is moving from an array of structs to a struct of arrays; that's a lot more complex of a code reorganization than you'd want a compiler doing.
It would be good to have a static analyzer that could suggest these changes, but, at least in many cases, you don't want them done automatically.
I don't want my compiler adding more padding than bare minimum to every struct. I don't want it transforming an AoS to SoA when I choose AoS to match data access patterns. And so on...
At best Go could add some local directives for compiling these optimizations, but these code changes are really minimal anyways. I would rather see the padding explicitly than some abstract directive.
I guess this is largely provided by std::hardware_destructive_interference_size in C++17, but I'm not sure if there are other language equivalents.
https://en.cppreference.com/w/cpp/thread/hardware_destructiv...
https://golangci-lint.run/docs/linters/
This is worth adding in Go race detector's mechanism to warn developer
That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.
If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.