@aras That second half is different if you have other component counts. It's still prefix-sum-y in that case, but summing across larger groups is normally still substantially cheaper than the full log2(nelems) reduction when you sum across all lanes in a vector.