mastodon.gamedev.place is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon server focused on game development and related topics.

Server stats:

5.1K
active users

By request, my usual "the least interesting part about AVX-512 is the 512 bits vector width" infodump in thread form.

So here goes, a laundry list of things introduced with AVX-512 that I think are way more important to typical use cases than the 512-bit vectors are:

* Unsigned integer compares, at long last (used to be either 3 or 2 instructions, depending on what you were comparing against and how much of the prep work you could hoist out)
* Unsigned int <-> float conversions. Better late than never.
* Down converts (narrowing), not just up - these are super-awkward in AVX2 since the packs etc. can't cross 128b boundaries
* VPTERNLOGD (your swiss army railgun for bitwise logic, can often fuse 2 or even 3 ops into one)

* VPTEST (logical compare instead of arithmetic for vectors - another one that replaces what used to be either 2- or 3-instruction sequences)
* VRANGEPS (meant for range reduction, can do multiple things including max-abs of multiple operands - another one that folds what is typically 2-4 instructions into one)
* Predicates/masks and "free" predication
* VRNDSCALE (round to fixed point with specified precision, yet another one that replaces 3 ops - mul, round, mul)

* VPRO[LR] (bitwise rotate on vectors, super-valuable for hashing; yet another one that replaces what used to be 3 ops: shift, shift, or)
* VPERM[IT]2* (permute across two source registers instead of one - PPC/SPUs had this, ARM has this, it's yet another one that replaces usually 3 instructions with one: two permutes and an OR or similar. And 2 of these go to the often-contended full shuffle network)
* Broadcast load-operands (4-16x reduction in L1D$ space for all-lane constants for free)

* Compress/expand (these replaces what is typically some awkward move mask, big table, permute combination and is super-useful to have, and saves like ~1k-4k [sometimes more] of tables you otherwise keep warm in L1D for no good reason)
* Disp8*N encoding (purely a code size thing; this alone does a very good job at offseting the extra cost of EVEX)
* Variable shifts on not just DWords/QWords, but also words. (THANK YOU))

* (VBMI2) PSH[LR]D(V?) - double-wide shifts. yet another one that replaces what would usually be 3 instructions.

Notice a theme here? There's just a ton of efficiency stuff in there that patches holes that have been in the ISA forever and replaces 3-insn sequences with a single one, and it's often 3-insn seqs with 2 uops on a contended port down to 1.

More niche stuff:
* VNNI - VPDPBUSB is yet another "replaces 3 insns with one" example, in that case PMADDUBSW, PMADDWD, PADDD
* 52-bit IFMA, mostly interesting for crypto/big int folks
* VPMULTISHIFTQB in VBMI for bit wrangling
* The bfloat16 additions
* Fixed versions of RCP and RSQRT, namely RCP14 and RSQRT14, with an exact spec so they don't differ between Intel and AMD
* Vector leading zero count and pop count
* Math-library focused insns like VFIXUPIMM and VFPCLASS.

And last not least, AVX-512 doubles with FP/SIMD architectural register count from 16 to 32.

This, combined with the 512b vectors that are the "-512" part, quadruples the amount of architectural FP/SIMD state, which is one of the main reasons we _don't_ get any AVX-512 in the "small" cores.

Now the 32 arch regs can pay dividends for all SIMD code, the 512 bits, only for code that can actually efficiently go that wide, which is not _that_ frequent.

In short, the 512-bit-ness that's part of the name is what probably provides the least utility to most workloads (outside of things like HPC), it's what has hindered proliferation of it the most, and it's the main reason we still keep getting new x86 designs without it.

I wish that around Skylake Intel had defined an "AVX-256" subset that was in consumer SKUs, which is "all of AVX-512 except for the actual 512b-wide vectors".

Because if they had, we'd now be 8 years into "AVX-256" on clients and would probably have a very decent percentage of machines with it.

Then we would have all that goodness I just listed and never mind the 512b vectors that are honestly pretty niche

Tom Forsyth

@rygorous Yeah, an AVX256 with all the instructions but only 16 registers and 256 bits would be 95% of the goodness.

We only had 32 registers because it was an in-order machine and latency-hiding was the compiler's job. If it's an even mildly-OOO machine, 16 is just fine - it can add more behind your back according to perf requirements.