Fun to see just how good Apple CPUs are in single core performance, especially if you tune the code.
The index codec on Zen4 runs into some perf issues with clang though - don’t think Zen4 should necessarily be slower here. That said, MSVC is unfortunately even worse… we’ll see if these can be fixed by tweaking the code further.
MSVC issue was due to some seriously broken codegen; I was able to work around that thankfully, with perf closer to clang on index decoder now.
Clang has reasonable codegen and turns out gcc is an outlier here, using a surprising scalar to vector promotion that surprisingly saves time by increasing IPC. Not sure I want to replicate that for now…