mastodon.gamedev.place is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon server focused on game development and related topics.

Server stats:

5.4K
active users

thomastc | frozenfractal

Hmm, generating good hashes for use in simplex noise is surprisingly expensive. Has anyone ever tried using the CRC32 instructions that come with SSE4.2 for this? They only take 1 clock cycle, much faster than anything I've come up with.

I tried. It's bad. It *is* faster than permutation tables though.

Bottom center panel is my attempt.

AES instructions don't fare any better.

I guess I should bring out a profiler and actually see where the bottlenecks are.

Meanwhile, if someone knows a fast hash that takes 128 bits as input (x: i32, y: i32, seed: u64), works in AVX __mm256i registers and has good entropy in the lower bits, I'm all ears.

#[inline] ALL THE THINGS!

From 82 ms to 34 ms, that's an insane 2.4x speedup for such a trivial change. That'll teach me to trust the compiler to optimize properly.

My implementation now beats most of the others, with the exception of `fastnoise2` (C++) and `simdnoise` (pure Rust). But I have something they don't: tiling 😃

@thomastc best camo pattern of the set.

I mean.. you weren't planning on it, but there you go. =)

@thomastc does xxhash fit your need?

@bitinn Thanks, that's the sort of thing I was looking for. XXH3 doesn't beat the permutation table for performance (38 ms total runtime instead of 34 ms), but the visual quality is good.

@thomastc wouldn't inlining always speed everything up, but result in a larger executable?

@mark No, because it can cause instruction cache thrashing.

Honestly I was surprised that these functions weren't all inlined automatically by the compiler, because most of them consist of just one call to a compiler intrinsic.

doc.rust-lang.org/stable/refer

doc.rust-lang.orgCode generation - The Rust Reference

@thomastc what needs to be the output width of your hash? Otherwise, XOROSHIRO128+ / XOROSHIRO128++ have 128 bit state, generate 64 bit high-entropy pseudo-random numbers, and can very heavily be optimized. I suspect you don't need a single random number at a time, and there's AVX2 implementations out there that can generate four 64 bit values at a time.

@thomastc (note that in your "hash" use case, you don't simply return the sum of the lower and upper half of the 128 bits of state, but first do the state modification, i.e.,
```
uint64 state[2] = {x << 32 | y, SEED};
uint64_t tmp = state[1] ^ state[0];
return (rot_left(state[0], 55) ^ tmp ^ (tmp<<14)) + rot_left(tmp 36);
```
with `rot_left(x, k)` simply being `(x<<k) | (x >> (64-k))`.

@funkylab Thanks! I'm already tied closely to 32-bit numbers because I'm using __mm256i AVX registers with 8 lanes. Splitting up all the 64-bit operations would probably negate any performance gain.

I could instead try xoshiro128+ which has a 128-bits state (conveniently equal to my input size) and is specifically targeted at generating floats. prng.di.unimi.it/xoshiro128plu

I need two floats, so I could use s[0]+s[3] for one, and s[1]+s[2] for the other.

@thomastc what I coded above **is** Xoroshiro128+. The state *has* 128 bit (2× uint64_t, which I just wrote that way for ease of understanding).

@thomastc and as said, there's AVX2 impls of it, which do 4× generations (of 64 bit each) at once. The world is good :) (the ops don't need many _mm256 regs, so the intermediate steps should not be a problem – quite the contrary, if your CPU is dual issue on these ops, you get that for free)

@funkylab Huh, I don't know why I thought you wrote 256. Been staring too much at powers of 2 lately 😅

In any case, it turns out I need to run about 10 rounds of this before the output appears sufficiently random. Which makes sense, because adjacent grid points only differ by 1 bit, and the hash doesn't have enough avalanching going on to make that 1 bit affect everything.

And if I have to do 10 rounds, it turns out PCG2D wins the performance battle hands down.

@thomastc that must be wrong. The entropy of xoroshiro128+ is quite good, like, measured in large-scale statistic terms. Do you perhaps forgo to update what you call "seed" in your original question from the value you generate? The code snipped I sent was explicitly just for a single value :)

@thomastc aaaah now I understand what you're trying to achieve!
Maybe then a simple splitmix64 is nicer for you:
godbolt.org/z/b6Yx4fWGs
Note that you could shorten the constants considering you're only keeping the lower 32 bit. Also note that I'm disappointed in GCC missing the chance to put all four array elements in the vector version into the same 256 bit SIMD register automatically.

godbolt.orgCompiler Explorer - C++ (x86-64 gcc 14.2) uint32_t hash(int32_t x, int32_t y, uint64_t seed) { // based on splitmix64, but do let x << k overlap with y, to resolve local // similarity uint64_t z = ((uint64_t(x) << 30) | y) | (seed + 0x9e3779b97f4a7c15); z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9; z = (z ^ (z >> 27)) * 0x94d049bb133111eb; return z ^ (z >> 31); } // wish we had std::simd std::array<uint32_t, 4> hash(std::array<int32_t, 4> x, std::array<int32_t, 4> y, uint64_t seed) { std::array<uint64_t, 4> z; for (char i = 0; i < 4; ++i) { z[i] = ((uint64_t(x[i]) << 30) | y[i]) | (seed + 0x9e3779b97f4a7c15); } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 30)) * 0xbf58476d1ce4e5b9; } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 27)) * 0x94d049bb133111eb; } for (char i = 0; i < 4; ++i) { z[i] ^= z[i] >> 31; } return {static_cast<uint32_t>(z[0]), static_cast<uint32_t>(z[1]), static_cast<uint32_t>(z[2]), static_cast<uint32_t>(z[3])}; } int main() { uint64_t seed = 14131954677241054293ull; std::print("These are way too similar:\n"); std::print("{0:032b}|{0:08x}\n", hash(0, 0, seed)); std::print("{0:032b}|{0:08x}\n", hash(1, 0, seed)); std::print("And it doesn't get better farther away from the origin:\n"); std::print("{0:032b}|{0:08x}\n", hash(3891095228, 2465519915, seed)); std::print("{0:032b}|{0:08x}\n", hash(3891095229, 2465519915, seed)); std::print("{:—<50}\n", ""); // more at once std::array<int32_t, 4> x = {0, 1, static_cast<int32_t>(3891095228), static_cast<int32_t>(3891095229)}; std::array<int32_t, 4> y = {0, 0, static_cast<int32_t>(2465519915), static_cast<int32_t>(2465519915)}; auto res = hash(x, y, seed); std::print("{0:032b}|{0:08x}\n", res[0]); std::print("{0:032b}|{0:08x}\n", res[1]); std::print("{0:032b}|{0:08x}\n", res[2]); std::print("{0:032b}|{0:08x}\n", res[3]); }

@funkylab Yep, that's the ticket!

Meanwhile, I've found that Inigo Quilez's Integer Hash21 method has sufficient quality for my purposes: shadertoy.com/view/4tXyWN I just mixed the seed into the input vector. At the end I'm splitting the output into 2 parts of 16 bits, then making floats out of that, then normalizing that vector... it doesn't seem to matter that the lower bits are lower quality.

I even slimmed it down a bit by removing the (p>>28) and ^(n>>15) and it's still fine.

@thomastc Again in GLSL, I like to use the pcg4d hash from jcgt.org/published/0009/03/02/

Perhaps you could adapt it to SSE on the CPU. Or try applying a little bit-swizzling to the CRC32 input or output.

@scdollins Hmm, interesting. I had seen that paper, but was a bit overwhelmed with the number of options. I think I'd want pcg3d, and pass the seed as the third dimension, and then XOR the three outputs together to get the index into the gradient table?

@scdollins Orrr maybe I could interpret the output as a vec2 of floats and normalize it, doing away with the gradient table altogether.

@scdollins In terms of performance, pcg2d is very promising. Total running time went from 29 ms to 21 ms. I think that would pretty much eliminate the hashing as a bottleneck.

The output still leaves something to be desired though 😅 But who's counting?