My team started looking at subgroup support for #WebGPU again. We pushed it out of the initial #WGSL feature set due to our staffing load and suspected non portability. Revisiting it now.
Sadly, and frustratingly, implementations don't do what programmers think should happen. We are seeing very very nonportable behavior.
Still collecting data across devices and platforms that we will share soon enough.
@dneto This is a major topic of my research.
Many vectorisation papers, including some from my uni, were written in a world without non-uniform intrinsic like that in mind.
This led to using heuristics to make reconvergence decisions, often unstable under opts, which used to not matter semantically. Now it does!
The structured interpretation is the correct one, it's the one used by ISPC and others, and it's stable under optimisations, and programmers actually have the right intuition for it.
@dneto There are broadly two classes of implementations: LLVM-based shader compilers, which have arbitrary reconvergence behaviour, and NIR, which honors code structure throughout the pipeline.
LLVM and everything downstream from it, is facing a very tough challenge to address decades of development under scalar control-flow assumptions. I'm not sure it's fixable in a timely fashion
IRs that lack such information, like DXIL, are fundamentally flawed and need to be retrofit with structured info
@dneto If stable reconvergence semantics cannot be guaranteed, then non-uniform subgroup intrinsics are basically ill-defined nonsense.
Some people have argued that what threads participate to those should not be a correctness issue, that code should be written defensively.
To that I answer that if programmers are expected to respect uniformity guarantees by the API, then they must be given reliable tools to manipulate divergence and reconvergence.
@dneto If no reliable guarantees are provided, then non-uniform subgroup operations are conceptually a flawed idea and will only ever work on an ad-hoc basis in small programs with limited scopes.
And then I think they should just not be available at all, and the programming model should switch back to a purely scalar one, with SIMD execution being an implementation detail in sole control of the compiler.
@gob The SPIRV-Tools optimizer stack is another that always takes uniformity into account.
It's not the formalism that is decisive. It's possible to use an LLVM+based compiler safely. But only if you use transforms that don't degrade the reconvergence properties of the code. This requires extreme care and vigilance. But LLVM is a fast moving codebase and so doing so is very difficult. I think that's part of any so many GPU computers are longtime *forks*of LLVM.
@gob
The nonuniform subgroup built-ins are really useful.
Nobody ever said you had to use LLVM in your compiler stack. That is just not a requirement.
My team has built three GPU compilers:
Clspv: Clang+only hand picked LLVM passes then a SPIR-V backend.
We added the SPIR-V backend to Microsoft's DXC: straight from the Clang AST to its own glue, to SPIR-V. Never touches LLVM IR.
Tint: WGSL to SPIR-V, HLSL, MSL. No LLVM or Clang at all.
It can be done.
@dneto I'm not saying that. I've worked on a fair share of stuff which is also not LLVM-based.
I'm saying that, whether we like it or not, many drivers are still LLVM-based and therefore cannot support the non-uniform behaviours programmers expect out of those subgroup intrinsics.
I would love a world where this is not a problem anymore and non-uniform subgroup ops honour structure portably. In fact I've built a whole compiler that tries to force that behaviour...
Where we disagree is on the strength of the "cannot".