Algorithms for Modern Processor Architectures

(lemire.github.io)

284 points | by matt_d 2 days ago

11 comments

Tuna-Fish 1 day ago
> Rules of thumb
> Processors can 'learn' thousands of branches: benchmark over massive inputs.
> Pick a solution without branches when it provides the same performance.
Note that it is terrible idea to benchmark small parts of your software individually and compose together the parts that are individually the fastest. Branch history tables and branch target buffers are a shared resource, and performance can crater horribly if you routinely exceed what's provided. You should generally favor straight-line code even when it's slightly slower in microbenchmarks.
[-]
- ahartmetz 1 day ago
  > Note that it is terrible idea to benchmark small parts of your software individually and compose together the parts that are individually the fastest
  Certainly beats composing together the parts that are individually the slowest or not profiling at all. When you have to start worrying about shared CPU resources (main memory is a different matter), you are already doing much better than most software.
- dgacmu 1 day ago
  It's interesting how this happens with a lot of shared CPU resources, particularly cache, and even more particularly, icache: It's really easy when studying a small isolated component to end up at a solution that either uses more total instructions, or things like large lookup tables that end up composing badly in a more complicated context.
  [-]
appreciatorBus 2 days ago
Looks like this was delivered earlier today at SEA 2025, I hope there's video that will be available soon!
https://x.com/lemire/status/1947615932702200138
[-]
- curiouscoding 1 day ago
  I don't think talks are being recorded, unfortunately.
cb321 1 day ago
This is a good set of slides. Dan is a good guy. There are a few nits I would pick. Sqrt(N) convergence comes from independence not normality -- based on independence => linearity of variance. { So, N IID samples of any distribution have a sum with N times higher variance, but then dividing by N you et sqrt(N). } There is, of course, a higher order relationship between the variance / "scale^2" of the distro and its tails which statisticians refer to as "shape". He later goes on to mention the dependence problem, though, and the minimum dt solution that, relied upon by, e.g., https://github.com/c-blake/bu/blob/main/doc/tim.md. So, it's all good. He may have covered it in voice, even.
He also mentions the Sattolo used by https://github.com/c-blake/bu/blob/main/doc/memlat.md to do his memory latency measurements. One weird thing was how he said because of 1 byte/cycle is 4GB/s things are "easily CPU bound" while I feel like I've been "fighting The Memory Wall for at least 3 decades now..." even just from super-scalar CPUs, but he later does some vectorization stuff. That more relates to what calcs you are doing, of course, but high bandwidth memory is a big part of what nVidia is selling.
[-]
- mananaysiempre 1 day ago
  > One weird thing was how he said because of 1 byte/cycle is 4GB/s things are "easily CPU bound" while I feel like I've been "fighting The Memory Wall for at least 3 decades now..." even just from super-scalar CPUs
  I mean, he’s right? For mostly-linear processing like Lemire’s projects that he mentions in his talk (simdutf, simdjson), memory latency is completely absorbed by the prefetcher, so only memory bandwidth matters. You can’t saturate DDR4 bandwidth, or even a tenth of DDR5 bandwidth (or, like Lemire is fond of mentioning, even PCIe Gen 4 SSD bandwidth) if your code (after autovectorization) is structured around doing a thing per byte. Probably not even a thing per UTF-16 code unit. So in those cases—no, you’re not hitting the memory wall. Not even close. You’re hitting the clock-speed wall.
  [-]
  - cb321 1 day ago
    Well, like I said and you said - it depends on the calculations. :-) FWIW, I never meant to suggest he was wrong in any absolute sense, but I did do a double take reading that.
    Besides the fine distinction you made about byte-at-a-time, one other way I sometimes try to express the variety is "BLAS L1/L2/L3". These levels roughly correspond to how much "loop nesting" there is, or roughly how well can we amortize the cost of memory transfers. So, L1 would be scaling each element by 10.0X, say, while L3 would be matrix multiply. One might have to be steeped in numerical linear algebra to appreciate that kind of analogy/terminology, though. Latency can also be a big issue vs. BW in "it all depends."
    Ultimately, it comes down to "CPUs can do A LOT per clock cycle" these days - multi-way issue of up to 64-way vector instructions (byte-wise on avx512) and so on and what 1 cycle means can just vary by orders of magnitude (not even including multi-core orders of magnitude and then distributed orders of magnitude). But "not always". And it all depends. Part of CPUs getting so "big" is that the "play/drift" in various statements has a lot more flexibility. So, cross-talk about issues like this has also increased. As have, probably, double takes like I mentioned. :-)
adsharma 1 day ago
The most relevant topic in this space is: processors got wider, more speculative etc. But if you measure IPC across a wide range of workloads it way less than 1 instruction per cycle.
Talking about this to software engineers (this is pre vibe-code era), you get a response like "this is too low level and not worth it".
More such talks needed!
scottgg 1 day ago
If you’re already competent with systems languages in keen to _go on a wild deep dive_ into this sort of perf engineering, where’s a good place or good resources to get started ?
[-]
- matt_d 1 day ago
  Start here: "The Microarchitecture of Superscalar Processors," J.E. Smith and G.S. Sohi, Proc. IEEE, vol. 83 (1995) https://doi.org/10.1109/5.476078 https://courses.cs.washington.edu/courses/cse471/01au/ss_cgi...
  It's very brief (so if you only have time to read one thing you won't waste a single minute: it's to the point) and (perhaps surprisingly, given its publication date) highly relevant: In fact, right after reading Section IV.C (AMD K5) you should be able to immediately jump to and understand the microarchitectural diagrams of, say, AMD Zen 2: http://web.archive.org/web/20231213180041/https://en.wikichi... http://web.archive.org/web/20231213180041/https://en.wikichi... ...and any other contemporary CPU.
  Smith & Sohi paper is going to give you great intuition for what's essentially implemented by all modern CPUs: restricted dataflow (RDF) architecture.
  A great follow-up is a short blog post "A whirlwind introduction to dataflow graphs" by Fabian “ryg” Giesen: https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introdu...
  For context, all superscalar dynamically scheduled (out-of-order) CPUs implement RDF, introduced by Yale Patt's research group's work on High Performance Substrate (HPS). This work pretty much defined the modern concept of a restricted dataflow CPU: breaking complex instructions into smaller micro-operations and dynamically scheduling these to execute out-of-order (or, to be more specific, in a restricted dataflow order) on multiple execution units.
  (If you're curious, the "restricted" part comes from the finite instruction window delimiting the operations to be scheduled, which stems from the finite size of all the physical resources in real hardware, like the ROB/reorder buffer. The "dataflow" comes from having to respect data dependencies, like `A = 1` and `B = 2` having to execute before `C = A + B`, or `MOV A, 1` and `MOV B, 2` having to execute before `ADD C, A, B`; but note that you can freely reorder the aforementioned moves among themselves as long as you execute them before the add: a schedule/execution order of `MOV B, 2; MOV A, 1; ADD C, A, B` is just as valid).
  For the historical background, see "HPS papers: A retrospective", https://www.researchgate.net/publication/308371952_HPS_paper... (Minor per peeve warning:) It also sets the record straight w.r.t. to the RISC being orthogonal to the historical development of modern superscalar out-of-order CPUs: In particular, it's worth noting that the aforementioned micro-operations have absolutely _nothing_ to do with RISC! Another great resource on that is Fabian's video "CPU uArch: Microcode", https://www.youtube.com/watch?v=JpQ6QVgtyGE (also worth noting that micro-operations and microcode are _very_ different concepts; that's also very well covered by the video).
  Another good, succinct description of the historical background is the 2024 IEEE CS Eckert-Mauchly Award (Wen-mei Hwu was a PhD student in the aforementioned Yale Patt's group): "Hwu was one of the original architects of the High-Performance Substrate (HPS) model that pioneered superscalar microarchitecture, introducing the concepts of dynamic scheduling, branch prediction, speculative execution, a post-decode cache, and in-order retirement." - https://www.acm.org/articles/bulletins/2024/june/eckert-mauc...
  On a side note, load-store architecture introduced by CDC 6600 (designed by Seymour Cray in the 1960s) is sometimes mistaken for RISC (which came decade+ later, arguably introduced in IBM 801 designed by John Cocke in the late 1970s/early 1980), https://en.wikipedia.org/wiki/Load%E2%80%93store_architectur...
  One could say load-store architecture does have an influence on compiler backends implementation, after a fashion (thinking of instruction scheduling, with a complex operation combining, say, LOAD with arithmetic operation OP, broken down in the scheduling models as separate LOAD and OP operations/effects).
  For more, I'd recommend:
  "Performance Analysis and Tuning on Modern CPU" by Denis Bakhvalov https://github.com/dendibakh/perf-book/releases
  Agner Fog's Optimization Manuals https://www.agner.org/optimize/
  Prof. Onur Mutlu's Lecture Videos and Materials https://people.inf.ethz.ch/omutlu/lecture-videos.html
  These are absolutely fantastic--I've followed all of these a few years back and can vouch for the quality and being up-to-date or at least decades ahead of the general computer architecture textbooks: the lectures and readings cover contemporary work including ISCA and MICRO papers that have only recently found their way to the implementation in production CPUs (e.g., the hashed perceptron predictor that's one of the branch predictor units in AMD Zen 2, http://web.archive.org/web/20231213180041/https://en.wikichi...).
  There are more, topic-specific texts that are very good, e.g., A Primer on Memory Consistency and Cache Coherence, Second Edition; Synthesis Lectures on Computer Architecture 15:1 (2020); Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, David A. Wood; open access (freely available) at https://doi.org/10.2200/S00962ED2V01Y201910CAC049 but when you are at this point you're likely going to be good at finding the relevant resources yourself, so I'm going to leave it to you to explore further :-)
  [-]
  - scottgg 1 day ago
    Hey Matt this is comprehensive- thanks ! Excited to get stuck in
benob 1 day ago
Are there efforts to include the neccessary context in compilers to autovectorize?
[-]
- yvdriess 1 day ago
  What do you mean with necessary context?
  Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code. The usual workflow I use is to translate the vector notation (RIP Cilk+ array syntax) in my paper notes to plain C loops. The compiler's optimization report (-qopt-report for icx, gcc has -fopt-info-vec and -fopt-info-vec-missed) gives feedback on what optimizations it considered and why it did not apply them. In more complex scenarios it can be helpful to add `#pragma omp simd` pragmas or similar to overrule the C semantics.
  [-]
  - Sesse__ 1 day ago
    > Modern compilers all autovectorize really well. Usually writing plain canonical loops with plain C arrays is a good way to write portable optimal SIMD code.
    I don't think I've seen this happen once at work (always using recent Clang) the last… three years? And I've written a fair bunch of SIMD code.
    Autovectorization works fairly well when all of your code is very vertical (no interference between the elements) _and_ the compiler knows that your N is divisible by some pretty number (or else, that code bloat is OK enough that it can unroll and deal with the tail). And in the occasional very standard horizontal case (e.g., sum a bunch of bytes, N % 16 == 0). Otherwise, it's just a sad, sad story IME. For most of the algorithms in the presentation at hand, the compiler will have no chance at autovectorizing them.
    [-]
    - camel-cdr 1 day ago
      I think you can coax most of them into autovectorizing via openmp and a bit of restructuring. Can you give 1-3 examples of these type of problems where you think autovectorization is lacking? I want to give it a try.
      [-]
      - Sesse__ 1 day ago
        Let me give you the last couple of routines that I was involved in vectorizing, no cherry-picking:
        - Skip to the first unmatched } (i.e., if there's two { in there, you are looking for the third }), ignoring anything inside "", and that any character may be escaped using a backslash. (I've simplified away a bunch of complicating cases, or it would just sound unreasonable. The optimal path depends on whether you have e.g. PCLMUL.) - Skip to the next <, \r, \0 or & (note that the optimal path on SSSE3+ uses pshufb, or similarly VTBL on NEON) and return its position. A typical case is that it's ~10 bytes away, but it could be kilobytes. - Parse the decimal part of a double, given that only accuracy up to 1e-7 is needed. (The optimal solution differs somewhat between NEON and SSE2.)
        The second one is the only one where I believe that autovectorization could have reasonably done a fair job, since it's so standard (but it didn't in practice).
        [-]
        camel-cdr 1 day ago
        So I think I managed to do all of the above with varying degrees of success: https://godbolt.org/z/WY99vxs76
        * parse_fract: I got that one down to 23 instructions on icx, although gcc took 81 and clang 91
        Since both of the other two return an index, I decided to keep it simple and use a 128 iteration inner loop that accumulates the index into a 8-bit integer, so I don't have to widen. 128 instead of 256, because I needed a sentinel value.
        * find_unmatched: obviously the compiler couldn't figure out the clmul trick. icx: 0.86 instr/byte, gcc: 0.625 instr/byte, clang at failed to vectorize the +-scan.
        * find_special: The LUT didn't end up working that well, so I'm doing the four comparisons separately. icx: 0.45 instr/byte, gcc: 0.30 instr/byte, clang: 0.25 instr/byte
        (I used znver5 as the target for gcc and clang, but znver4 for icx)
        These were more painful to do than need be, somebody should try it with ISPC and see how that compares.
        I didn't know about the inclusive scan support in OpenMP before writing this. It's almost good, but the implementations are slightly buggy, and it seems to be designed with threading, not SIMD, in mind. In the sense that you have to write the scan into an array, while in SIMD you don't need that, in multi-threading you need the buffer to do a scan-tree-reduction.
        The other problem is early exit loops, which should totally be permissible. icc also had support for early_exit, but icx doesn't support it anymore. Wouldn't you "just" need to do an or reduction on the condition mask and break if one bit was set?
        Thanks for the suggestions. Sounds like you are working on some kind of parser?
        [-]
        Sesse__ 1 day ago
        So I looked at these briefly without AVX512, because only a tiny fraction of people have anything like that and the claim was that this would be a great way of making portable SIMD :-) Also, obviously you cannot use -ffast-math in real code.
        parse_fract seems really, really inefficient. Even on plain SSE2, you can do with unpack + sub + muladd + shuffle + add and then some scalar cleanup (plus the loads, of course). icx looks to be, what, 40 SIMD instructions?
        find_unmatched is just scalar code cosplaying SIMD; 150 instructions and most of them do one byte at a time.
        find_special seemingly generates _thousands_ of instructions! What you need, for each 16-byte loop, is a pshufb + compare + pmovmskb (0x80.pl is down now, but IIRC it's explained there). It helps that all of the values in question differ in one of the nibbles.
        I am not that convinced by this being a usable universal technique :-) Even after switching back to supporting AVX512, the generated code is a huge mess. Seemingly these three functions together clock in at 6+ kB, which is ~10% of your entire icache.
        > Sounds like you are working on some kind of parser?
        All of these are for HTML/CSS, yes.
        [-]
        camel-cdr 1 day ago
        Yeah, I wouldn't use this type of thing for these problems, this was a huge mess. But I think code designed for autovec is resonable as a scalar base implementation for a larger set of problems than people think.
        I've seen the problem before, where people explicitly vectorized (sse) something, but the code could be autovectorized (avx) and outperformed the explicitly vectorized one, because the compiler could take advantage of newer extensions.
        You really should be able to use a ISPC like subset in your C code, OpenMP goes into the right direction, but it's not designed with SIMD as the highest priority.
        [-]
        Sesse__ 20 hours ago
        Yes, I'd say that the strongest part of autovectorization is that you can get more-or-less automatic support for wider/newer instruction sets than what you had when you wrote the code; older SIMD, like all other code, tends to rot. Of course, this is predicated on either having function multiversioning (works, but very rare) or being able to frequently raise the minimum target for your binary.
        [-]
        janwas 20 hours ago
        You also get the automatic support for newer instructions (and multiversioning) with a wrapper library such as our Highway :)
        [-]
anonymousDan 1 day ago
Is there a talk to go with the slides by any chance?
DrNosferatu 1 day ago
I’m surprised so much branching isn’t more costly.
[-]
- yvdriess 1 day ago
  Branch predictors have gotten really good and it often now makes more sense to rely on it rather than working away the branches.
  For example, modern compilers will very rarely introduce conditional moves (cmov) x86 because they are nearly always slower than simply branching. It might be counter intuitive, but a branch prediction breaks the dependencies of the micro-ops between the conditional and the clause. So if your cmov's conditional depends on a load, you need to wait for that load complete before it can execute.
  Always benchmark with at-scale data and measure.
  [-]
  - TuxSH 1 day ago
    > For example, modern compilers will very rarely introduce conditional moves
    For conditionally-selected data that lives in registers (and occasionally, on the stack), GCC seems to always use cmov (as it is much cheaper than a branch with possibly p=0.5 after all)
    You do have a very good point about data dependencies.
    Here a Aarch32 vs x86 vs Aarch64 vs x64 comparison: https://godbolt.org/z/hEac4sz7h
    - for "c ? a : b" (where a, b, c are func args), all 4 versions use their version of cmov
    - for "c ? *a : *b", x64 version uses cmov on the address whereas Aarch64 uses a "full" branch
    - Aarch32 always use conditional instructions in these 2 expressions, and additionally, "a * (b & 1)" gets optimized into "a & ((b & 1) ? ~0 : (b & 1) /* = 0 */)"
- mycatisblack 1 day ago
  Depends on the branch predictor: correct branch, everything’s loaded and set. Wrong branch: flush it all and load again.
  If you know the branch predictor algorithm you can optimise for it.
  Edit: it’s on p27
NooneAtAll3 2 days ago
apple still uses utf16?
[-]
- vanderZwan 1 day ago
  JavaScript does, so the web does, so by extension Apple probably does care about utf16.
  [-]
  - jiggawatts 1 day ago
    Also: Java, DotNet, and Windows all use 2-byte char types.
    [-]
    - looperhacks 1 day ago
      Akchyually! These days, Java uses Latin1 if no characters outside Latin1 are used. Only if full Unicode is necessary, it uses UTF-16
      [-]
      - josephg 1 day ago
        Apple does something similar for strings in objc and Swift. They do lots of other optimisations besides - like small string optimisation for short strings to avoid heap allocations.
- flohofwoe 1 day ago
  NSString is UTF-16 internally: https://developer.apple.com/documentation/foundation/nsstrin...
  ...it's trivial to get UTF-8 strings into and out of an NSString though so the internal representation doesn't matter all that much.
  More importantly, all of the actual user-facing side of macOS is UTF-8 (e.g. you can simply copy-paste an UTF-8 encoded UNICODE string literal into a C source file, printf() it and it will show up correctly on the terminal without tinkering with text editor or locale settings).
- markasoftware 2 days ago
  is this talk about apple? Regardless, lots of language runtimes still use utf16 (eg java, qt, haskell), and windows certainly still uses utf16.
  [-]
  - zopa 1 day ago
    Nit, but if you're thinking of Haskell's text library, it's utf8 now: https://github.com/haskell/text/pull/365
    [-]
    - markasoftware 1 day ago
      oh wow, I remember reading a blog post about how this effort wasn't going well a while ago. I guess it got through in the end!
phkahler 2 days ago
Pentium 4 didn't hit 3.8GHz. It melted at 1.4 or so.
[-]
- wtallis 2 days ago
  The Pentium 3 is what eventually topped out at 1.4 GHz, for the 130nm Tualatin parts introduced in 2001. The Pentium 4 started at 1.4GHz and 1.5GHz with the 180nm Willamette parts introduced in 2000. Those were eventually released with speeds up to 2.0GHz. The 130nm Pentium 4 Northwood reached 3.4GHz in 2004, and the 90nm Pentium 4 Prescott hit 3.8GHz later in 2004.
- necubi 2 days ago
  The Pentium 4 HT 670, released in 2005, came factory-clocked at 3.8 (https://www.techpowerup.com/cpu-specs/pentium-4-ht-670.c20)
  Netburst lasted a long time as intel was floundering, before Core Duo was released in 2006.
- bayindirh 1 day ago
  Intel released a couple of Pentium 4's from different cores topping at 3.8GHz [0].
  Tom's Hardware overclocked one of these Northwood Pentium 4's to 5 GHz with liquid nitrogen and a compressor [1].
  Those were the days, honestly.
  [0]: https://en.wikipedia.org/wiki/Pentium_4
  [1]: https://www.youtube.com/watch?v=z0jQZxH7NgM
- vrighter 1 day ago
  Funny... my pentium 4 ran at 1.6GHz
- anthk 1 day ago
  It hit 3.8 and for a while it surpassed multiple cores' performance because games were built to work on single cores and multithreaded/multicore. It happened with some emulators.
IgnaciusMonk 2 days ago
I do not want to be rude but this is exactly why LLVM being in hands of same entity which controls access to / owns platform is insane.
edit - #64 E ! Also, i always say, human body is most error prone measuring device humans have in their disposal.
[-]
- bayindirh 1 day ago
  Both LLVM and GCC is being supported by processor manufacturers directly. Yes, Apple and Intel has their own LLVM versions, but as long as don't break compatibility with GCC and doesn't prevent porting explicitly, I don't see a problem.
  I personally use GCC suite exclusively though, and while LLVM is not my favorite compiler, we can thank them for spurring GCC team into action for improving their game.
  [-]
  - dazzawazza 1 day ago
    > ... and while LLVM is not my favorite compiler, we can thank them for spurring GCC team into action for improving their game.
    Exactly. I think people have forgotten just how poor GCC was 15 years ago. Both teams are doing excellent work. Even M$ has been upping it's game with it's compiler!
- almostgotcaught 2 days ago
  Whose hands exactly is LLVM in?
- gleenn 2 days ago
  Can you be more explicit? Is it because they are optimizing too much to a single platform that isn't generalizable to other compilers or architectures? What's your specific gripe?
- IgnaciusMonk 2 days ago
  Also to be more controversial. - redhat deprecated x86_64_v1 & x86_64v2 . and people were crying because of that....
  [-]
  - volf_ 2 days ago
    A commercial enterprise is dropping support for older cpu architectures in their newer OSs so they can improve the average performance of the deployed software?
    Don't see how that's controversial. It's something that doesn't matter to their customers or their business.
  - bayindirh 1 day ago
    The newest x86_64-v1 server is older than a decade now, and I'm not sure -v2 is deprecated. RockyLinux 9 is running happily on -v2 hardware downstairs.
    Oh, -v2 is deprecated for RH10. Not a big deal, honestly.
    From a fleet perspective, I prefer more code uses more advanced instructions on my processors. Efficiency goes up on hot code paths possibly. What's not to love?
    [-]
    - IgnaciusMonk 1 day ago
      Rocky linux is in cahoots with Oracle, do not supporting that with anything, not even with words. Go Alma linux if you need Red Hat but with different name but for love of anything good in this world, boycott everyone friendly with Ellison.
      [-]
      - bayindirh 1 day ago
        Interesting. Source?
    - homebrewer 1 day ago
      One more reason to switch to a better alternative:
      https://lwn.net/Articles/1010868/
      https://almalinux.org/blog/2025-05-27-welcoming-almalinux-10...
      tl;dr: AlmaLinux will support v2 in EL10 as a separate rebuild in the near future.
      [-]
      - bayindirh 1 day ago
        In our case OS selection is done on a case by case basis, and we don't take sides. In our case depreciation of V2 has no practical implications, either.
        This is also same on the personal level. I use the OS which is most suitable for the task at hand, and the root OS (Debian / RedHat / etc.) doesn't matter. I'm comfortable with all of them the same.
      - jonathanspw 1 day ago
        We already support it - the build is live.
        https://repo.almalinux.org/almalinux/10/isos/x86_64_v2/
        We even do a full EPEL rebuild for it as well.
    - scns 1 day ago
      The newest x86_64-v1 server is older than a decade now
      Did you mean v3?
      [-]
      - bayindirh 1 day ago
        No, v1. I mean, you can't buy a x86_64-v1 server for a decade now, and if you have one and it's alive, it's a very slim chance it's working unless it's new old stock.
        If it has seen any decent amount of workload during its lifetime, it possibly has a couple of ICs which reached their end of their electronic life and malfunctioning.
  - anthk 1 day ago
    Gemini Lake runs pretty well. If that happens, bye Fedora Bazzite with Linux-Libre on top.