mastodon.gamedev.place is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon server focused on game development and related topics.

Server stats:

5.3K
active users

Hmm, generating good hashes for use in simplex noise is surprisingly expensive. Has anyone ever tried using the CRC32 instructions that come with SSE4.2 for this? They only take 1 clock cycle, much faster than anything I've come up with.

I tried. It's bad. It *is* faster than permutation tables though.

Bottom center panel is my attempt.

AES instructions don't fare any better.

I guess I should bring out a profiler and actually see where the bottlenecks are.

Meanwhile, if someone knows a fast hash that takes 128 bits as input (x: i32, y: i32, seed: u64), works in AVX __mm256i registers and has good entropy in the lower bits, I'm all ears.

@thomastc what needs to be the output width of your hash? Otherwise, XOROSHIRO128+ / XOROSHIRO128++ have 128 bit state, generate 64 bit high-entropy pseudo-random numbers, and can very heavily be optimized. I suspect you don't need a single random number at a time, and there's AVX2 implementations out there that can generate four 64 bit values at a time.

@thomastc (note that in your "hash" use case, you don't simply return the sum of the lower and upper half of the 128 bits of state, but first do the state modification, i.e.,
```
uint64 state[2] = {x << 32 | y, SEED};
uint64_t tmp = state[1] ^ state[0];
return (rot_left(state[0], 55) ^ tmp ^ (tmp<<14)) + rot_left(tmp 36);
```
with `rot_left(x, k)` simply being `(x<<k) | (x >> (64-k))`.

@funkylab Thanks! I'm already tied closely to 32-bit numbers because I'm using __mm256i AVX registers with 8 lanes. Splitting up all the 64-bit operations would probably negate any performance gain.

I could instead try xoshiro128+ which has a 128-bits state (conveniently equal to my input size) and is specifically targeted at generating floats. prng.di.unimi.it/xoshiro128plu

I need two floats, so I could use s[0]+s[3] for one, and s[1]+s[2] for the other.

@thomastc what I coded above **is** Xoroshiro128+. The state *has* 128 bit (2× uint64_t, which I just wrote that way for ease of understanding).

@thomastc and as said, there's AVX2 impls of it, which do 4× generations (of 64 bit each) at once. The world is good :) (the ops don't need many _mm256 regs, so the intermediate steps should not be a problem – quite the contrary, if your CPU is dual issue on these ops, you get that for free)

@funkylab Huh, I don't know why I thought you wrote 256. Been staring too much at powers of 2 lately 😅

In any case, it turns out I need to run about 10 rounds of this before the output appears sufficiently random. Which makes sense, because adjacent grid points only differ by 1 bit, and the hash doesn't have enough avalanching going on to make that 1 bit affect everything.

And if I have to do 10 rounds, it turns out PCG2D wins the performance battle hands down.

@thomastc that must be wrong. The entropy of xoroshiro128+ is quite good, like, measured in large-scale statistic terms. Do you perhaps forgo to update what you call "seed" in your original question from the value you generate? The code snipped I sent was explicitly just for a single value :)

@thomastc aaaah now I understand what you're trying to achieve!
Maybe then a simple splitmix64 is nicer for you:
godbolt.org/z/b6Yx4fWGs
Note that you could shorten the constants considering you're only keeping the lower 32 bit. Also note that I'm disappointed in GCC missing the chance to put all four array elements in the vector version into the same 256 bit SIMD register automatically.

godbolt.orgCompiler Explorer - C++ (x86-64 gcc 14.2) uint32_t hash(int32_t x, int32_t y, uint64_t seed) { // based on splitmix64, but do let x << k overlap with y, to resolve local // similarity uint64_t z = ((uint64_t(x) << 30) | y) | (seed + 0x9e3779b97f4a7c15); z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9; z = (z ^ (z >> 27)) * 0x94d049bb133111eb; return z ^ (z >> 31); } // wish we had std::simd std::array<uint32_t, 4> hash(std::array<int32_t, 4> x, std::array<int32_t, 4> y, uint64_t seed) { std::array<uint64_t, 4> z; for (char i = 0; i < 4; ++i) { z[i] = ((uint64_t(x[i]) << 30) | y[i]) | (seed + 0x9e3779b97f4a7c15); } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 30)) * 0xbf58476d1ce4e5b9; } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 27)) * 0x94d049bb133111eb; } for (char i = 0; i < 4; ++i) { z[i] ^= z[i] >> 31; } return {static_cast<uint32_t>(z[0]), static_cast<uint32_t>(z[1]), static_cast<uint32_t>(z[2]), static_cast<uint32_t>(z[3])}; } int main() { uint64_t seed = 14131954677241054293ull; std::print("These are way too similar:\n"); std::print("{0:032b}|{0:08x}\n", hash(0, 0, seed)); std::print("{0:032b}|{0:08x}\n", hash(1, 0, seed)); std::print("And it doesn't get better farther away from the origin:\n"); std::print("{0:032b}|{0:08x}\n", hash(3891095228, 2465519915, seed)); std::print("{0:032b}|{0:08x}\n", hash(3891095229, 2465519915, seed)); std::print("{:—<50}\n", ""); // more at once std::array<int32_t, 4> x = {0, 1, static_cast<int32_t>(3891095228), static_cast<int32_t>(3891095229)}; std::array<int32_t, 4> y = {0, 0, static_cast<int32_t>(2465519915), static_cast<int32_t>(2465519915)}; auto res = hash(x, y, seed); std::print("{0:032b}|{0:08x}\n", res[0]); std::print("{0:032b}|{0:08x}\n", res[1]); std::print("{0:032b}|{0:08x}\n", res[2]); std::print("{0:032b}|{0:08x}\n", res[3]); }

@funkylab Yep, that's the ticket!

Meanwhile, I've found that Inigo Quilez's Integer Hash21 method has sufficient quality for my purposes: shadertoy.com/view/4tXyWN I just mixed the seed into the input vector. At the end I'm splitting the output into 2 parts of 16 bits, then making floats out of that, then normalizing that vector... it doesn't seem to matter that the lower bits are lower quality.

I even slimmed it down a bit by removing the (p>>28) and ^(n>>15) and it's still fine.