By request, my usual "the least interesting part about AVX-512 is the 512 bits vector width" infodump in thread form.
So here goes, a laundry list of things introduced with AVX-512 that I think are way more important to typical use cases than the 512-bit vectors are:
* Unsigned integer compares, at long last (used to be either 3 or 2 instructions, depending on what you were comparing against and how much of the prep work you could hoist out)
* Unsigned int <-> float conversions. Better late than never.
* Down converts (narrowing), not just up - these are super-awkward in AVX2 since the packs etc. can't cross 128b boundaries
* VPTERNLOGD (your swiss army railgun for bitwise logic, can often fuse 2 or even 3 ops into one)
* VPTEST (logical compare instead of arithmetic for vectors - another one that replaces what used to be either 2- or 3-instruction sequences)
* VRANGEPS (meant for range reduction, can do multiple things including max-abs of multiple operands - another one that folds what is typically 2-4 instructions into one)
* Predicates/masks and "free" predication
* VRNDSCALE (round to fixed point with specified precision, yet another one that replaces 3 ops - mul, round, mul)
* VPRO[LR] (bitwise rotate on vectors, super-valuable for hashing; yet another one that replaces what used to be 3 ops: shift, shift, or)
* VPERM[IT]2* (permute across two source registers instead of one - PPC/SPUs had this, ARM has this, it's yet another one that replaces usually 3 instructions with one: two permutes and an OR or similar. And 2 of these go to the often-contended full shuffle network)
* Broadcast load-operands (4-16x reduction in L1D$ space for all-lane constants for free)
* Compress/expand (these replaces what is typically some awkward move mask, big table, permute combination and is super-useful to have, and saves like ~1k-4k [sometimes more] of tables you otherwise keep warm in L1D for no good reason)
* Disp8*N encoding (purely a code size thing; this alone does a very good job at offseting the extra cost of EVEX)
* Variable shifts on not just DWords/QWords, but also words. (THANK YOU))
* (VBMI2) PSH[LR]D(V?) - double-wide shifts. yet another one that replaces what would usually be 3 instructions.
Notice a theme here? There's just a ton of efficiency stuff in there that patches holes that have been in the ISA forever and replaces 3-insn sequences with a single one, and it's often 3-insn seqs with 2 uops on a contended port down to 1.
More niche stuff:
* VNNI - VPDPBUSB is yet another "replaces 3 insns with one" example, in that case PMADDUBSW, PMADDWD, PADDD
* 52-bit IFMA, mostly interesting for crypto/big int folks
* VPMULTISHIFTQB in VBMI for bit wrangling
* The bfloat16 additions
* Fixed versions of RCP and RSQRT, namely RCP14 and RSQRT14, with an exact spec so they don't differ between Intel and AMD
* Vector leading zero count and pop count
* Math-library focused insns like VFIXUPIMM and VFPCLASS.
And last not least, AVX-512 doubles with FP/SIMD architectural register count from 16 to 32.
This, combined with the 512b vectors that are the "-512" part, quadruples the amount of architectural FP/SIMD state, which is one of the main reasons we _don't_ get any AVX-512 in the "small" cores.
Now the 32 arch regs can pay dividends for all SIMD code, the 512 bits, only for code that can actually efficiently go that wide, which is not _that_ frequent.
In short, the 512-bit-ness that's part of the name is what probably provides the least utility to most workloads (outside of things like HPC), it's what has hindered proliferation of it the most, and it's the main reason we still keep getting new x86 designs without it.
I wish that around Skylake Intel had defined an "AVX-256" subset that was in consumer SKUs, which is "all of AVX-512 except for the actual 512b-wide vectors".
Because if they had, we'd now be 8 years into "AVX-256" on clients and would probably have a very decent percentage of machines with it.
Then we would have all that goodness I just listed and never mind the 512b vectors that are honestly pretty niche
@rygorous Or the double-pumped stuff that was done by AMD for AVX-512 (and one or both did for AVX2 as well IIRC???)
Get the instruction set out there, but not necessarily with full performance benefits.
@scottmichaud No that doesn't help.
The problem is the amount of state, not the width of the implementation.
@rygorous Oh really?
@scottmichaud I like to dig out die shots of the original AMD Zen for a scale reference.
Wikichip has die shots of the original Zen 1: https://en.wikichip.org/wiki/File:amd_zen_core.png
Here's my annotated version with the FP/SIMD register file marked, and capacities noted.
@scottmichaud AVX-512 _architectural_ state, with absolutely no extra registers for renaming, just the bare minimum you need for a single hardware thread (can't do 2 HW threads either), is 32 regs * 512 bits = 2KiB.
Why are RFs this big compared to more compact structures like the L1D cache? Mainly because they need a ton of ports. I think Zen1's FP RF needs around 8-9 read plus 4 write ports per cycle. A banked L1D is either single- or dual-ported usually.
@scottmichaud Zen 1 isn't super-dense either, that's still a part targeting quite high frequencies.
The main reason you don't see AVX-512 in smaller cores is basically that. They literally don't have the area budget for a register file so huge you can see it from space
@scottmichaud Note the "mold" growing to either side of the FP/SIMD RF is the actual SIMD unit logic. Programmers tend to overestimate how much area is spent on data path (i.e. the stuff that does the actual computation) vs. logistics. In this case, coarsely eyeballing it, looks like more than 1/3rd of the entire FP/SIMD unit is just that register file.
@scottmichaud also pretty much anything you see in that image that has this obvious regular structure is memory of one kind or another - usually caches.
@rygorous @scottmichaud Yup, 1/3 RF, 1/3 ALU, 1/3 "everything else" is pretty normal (just for the FP unit itself)
@rygorous @scottmichaud It's called die area because you die when you realize how much it is.
@rygorous Yeah, an AVX256 with all the instructions but only 16 registers and 256 bits would be 95% of the goodness.
We only had 32 registers because it was an in-order machine and latency-hiding was the compiler's job. If it's an even mildly-OOO machine, 16 is just fine - it can add more behind your back according to perf requirements.
@rygorous And instead of just adding some new feature bit to allow AVX-512VL without AVX-512F (i.e. 128/256-bit AVX-512 instructions without 512-bit), Intel seems to prefer piecemeal backporting EVEX instructions to VEX (e.g. AVX-VNNI, AVX-IFMA).
@rygorous As a proud father, I endorse this list. We chose the width before AVX existed, so both 256b and 512b were on the table, and it was a very close decision! The deciding factor was (a) wider was better and (b) 512b is 16x float32, and there was a nice 4x4 symmetry to 16 lanes that we thought would be more important than it turned out to be (because in practice nobody cared about porting SSE code as-is).
More info on the origins here:
https://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4%20full%20fat.pdf
@rygorous how good are compilers at emitting all of that? Or is all this goodness locked away for people hand tuning their code?
@BuschnicK pretty much everything I've noted is extremely easy for compilers to use and they've been doing it well since around 2015-16
@BuschnicK half the things I mentioned are filling in potholes in SSE*/AVX* that compilers have had to work around for decades when vectorizing. E.g. lack of unsigned vector compares, uint<->float conversions, vector rotates, word-sized variable shifts, two-input shuffles, good downconverts are all things compilers already had in their IR for over a decade and were forced to codegen to multi-instruction expansions because the actual ISA didn't have it.
@BuschnicK These are all things that were already in say LLVM IR long before AVX-512 came around, they just needed to be lowered into multiple instructions (sometimes quite a lot of them, e.g. for float<->uint conversions) which then made things like cost/benefit analysis for autovect pre-AVX512 much harder because the instruction set was so irregular