Back

I tried. It's bad. It *is* faster than permutation tables though.

Bottom center panel is my attempt.

A grid of 3x3 images of simplex noise. The one at the bottom center has a repeating pattern.

**thomastc | frozenfractal** @thomastc · Mar 18

AES instructions don't fare any better.

A similar image as before, but with a different repeating pattern.

**thomastc | frozenfractal** @thomastc · Mar 18

I guess I should bring out a profiler and actually see where the bottlenecks are.

Meanwhile, if someone knows a fast hash that takes 128 bits as input (x: i32, y: i32, seed: u64), works in AVX __mm256i registers and has good entropy in the lower bits, I'm all ears.

**thomastc | frozenfractal** @thomastc · Mar 18

#[inline] ALL THE THINGS!

From 82 ms to 34 ms, that's an insane 2.4x speedup for such a trivial change. That'll teach me to trust the compiler to optimize properly.

My implementation now beats most of the others, with the exception of `fastnoise2` (C++) and `simdnoise` (pure Rust). But I have something they don't: tiling

#RustLang #GameDev

**Jari Komppa** @sol_hsa@peoplemaking.games · Mar 18

Jari Komppa @sol_hsa@peoplemaking.games

@thomastc best camo pattern of the set.

I mean.. you weren't planning on it, but there you go. =)

**David Frank** @bitinn · Mar 18

David Frank @bitinn

@thomastc does xxhash fit your need?

**thomastc | frozenfractal** @thomastc · Mar 18

@bitinn Thanks, that's the sort of thing I was looking for. XXH3 doesn't beat the permutation table for performance (38 ms total runtime instead of 34 ms), but the visual quality is good.

**Mark IJbema** @mark@tacobelllabs.net · Mar 18

Mark IJbema @mark@tacobelllabs.net

@thomastc wouldn't inlining always speed everything up, but result in a larger executable?

**thomastc | frozenfractal** @thomastc · Mar 18

https://doc.rust-lang.org/stable/reference/attributes/codegen.html#the-inline-attribute

@mark No, because it can cause instruction cache thrashing.

Honestly I was surprised that these functions weren't all inlined automatically by the compiler, because most of them consist of just one call to a compiler intrinsic.

doc.rust-lang.orgCode generation - The Rust Reference

**Marcus Müller** @funkylab@mastodon.social · Mar 18

@thomastc what needs to be the output width of your hash? Otherwise, XOROSHIRO128+ / XOROSHIRO128++ have 128 bit state, generate 64 bit high-entropy pseudo-random numbers, and can very heavily be optimized. I suspect you don't need a single random number at a time, and there's AVX2 implementations out there that can generate four 64 bit values at a time.

**Marcus Müller** @funkylab@mastodon.social · Mar 18

@thomastc (note that in your "hash" use case, you don't simply return the sum of the lower and upper half of the 128 bits of state, but first do the state modification, i.e.,
```
uint64 state[2] = {x << 32 | y, SEED};
uint64_t tmp = state[1] ^ state[0];
return (rot_left(state[0], 55) ^ tmp ^ (tmp<<14)) + rot_left(tmp 36);
```
with `rot_left(x, k)` simply being `(x<<k) | (x >> (64-k))`.

**thomastc | frozenfractal** @thomastc · Mar 19

@funkylab Thanks! I'm already tied closely to 32-bit numbers because I'm using __mm256i AVX registers with 8 lanes. Splitting up all the 64-bit operations would probably negate any performance gain.

I could instead try xoshiro128+ which has a 128-bits state (conveniently equal to my input size) and is specifically targeted at generating floats. https://prng.di.unimi.it/xoshiro128plus.c

I need two floats, so I could use s[0]+s[3] for one, and s[1]+s[2] for the other.

**Marcus Müller** @funkylab@mastodon.social · Mar 19 *

Mar 19 *

@thomastc what I coded above **is** Xoroshiro128+. The state *has* 128 bit (2× uint64_t, which I just wrote that way for ease of understanding).

**Marcus Müller** @funkylab@mastodon.social · Mar 19 *

Mar 19 *

@thomastc and as said, there's AVX2 impls of it, which do 4× generations (of 64 bit each) at once. The world is good :) (the ops don't need many _mm256 regs, so the intermediate steps should not be a problem – quite the contrary, if your CPU is dual issue on these ops, you get that for free)

**thomastc | frozenfractal** @thomastc · Mar 19

@funkylab Huh, I don't know why I thought you wrote 256. Been staring too much at powers of 2 lately

In any case, it turns out I need to run about 10 rounds of this before the output appears sufficiently random. Which makes sense, because adjacent grid points only differ by 1 bit, and the hash doesn't have enough avalanching going on to make that 1 bit affect everything.

And if I have to do 10 rounds, it turns out PCG2D wins the performance battle hands down.

**Marcus Müller** @funkylab@mastodon.social · Mar 19

@thomastc that must be wrong. The entropy of xoroshiro128+ is quite good, like, measured in large-scale statistic terms. Do you perhaps forgo to update what you call "seed" in your original question from the value you generate? The code snipped I sent was explicitly just for a single value :)

**thomastc | frozenfractal** @thomastc · Mar 19

godbolt.orgCompiler Explorer - C++ uint64_t rot_left(const uint64_t x, int k) { return (x << k) | (x >> (64 - k)); } int32_t hash(int32_t x, int32_t y, uint64_t seed) { uint64_t state[2] = {uint64_t(uint32_t(x)) << 32 | uint32_t(y), seed}; uint64_t tmp = state[1] ^ state[0]; return (rot_left(state[0], 55) ^ tmp ^ (tmp<<14)) + rot_left(tmp, 36); } int main() { uint64_t seed = 14131954677241054293ull; printf("These are way too similar:\n"); printf("%08x\n", hash(0, 0, seed)); printf("%08x\n", hash(1, 0, seed)); printf("And it doesn't get better farther away from the origin:\n"); printf("%08x\n", hash(3891095228, 2465519915, seed)); printf("%08x\n", hash(3891095229, 2465519915, seed)); }

@funkylab Here's what I'm seeing: https://godbolt.org/z/rd349MqWf Those two output numbers are way too similar for my purposes. Am I doing something wrong?

**Marcus Müller** @funkylab@mastodon.social · Mar 19

godbolt.orgCompiler Explorer - C++ (x86-64 gcc 14.2) uint32_t hash(int32_t x, int32_t y, uint64_t seed) { // based on splitmix64, but do let x << k overlap with y, to resolve local // similarity uint64_t z = ((uint64_t(x) << 30) | y) | (seed + 0x9e3779b97f4a7c15); z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9; z = (z ^ (z >> 27)) * 0x94d049bb133111eb; return z ^ (z >> 31); } // wish we had std::simd std::array<uint32_t, 4> hash(std::array<int32_t, 4> x, std::array<int32_t, 4> y, uint64_t seed) { std::array<uint64_t, 4> z; for (char i = 0; i < 4; ++i) { z[i] = ((uint64_t(x[i]) << 30) | y[i]) | (seed + 0x9e3779b97f4a7c15); } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 30)) * 0xbf58476d1ce4e5b9; } for (char i = 0; i < 4; ++i) { z[i] = (z[i] ^ (z[i] >> 27)) * 0x94d049bb133111eb; } for (char i = 0; i < 4; ++i) { z[i] ^= z[i] >> 31; } return {static_cast<uint32_t>(z[0]), static_cast<uint32_t>(z[1]), static_cast<uint32_t>(z[2]), static_cast<uint32_t>(z[3])}; } int main() { uint64_t seed = 14131954677241054293ull; std::print("These are way too similar:\n"); std::print("{0:032b}|{0:08x}\n", hash(0, 0, seed)); std::print("{0:032b}|{0:08x}\n", hash(1, 0, seed)); std::print("And it doesn't get better farther away from the origin:\n"); std::print("{0:032b}|{0:08x}\n", hash(3891095228, 2465519915, seed)); std::print("{0:032b}|{0:08x}\n", hash(3891095229, 2465519915, seed)); std::print("{:—<50}\n", ""); // more at once std::array<int32_t, 4> x = {0, 1, static_cast<int32_t>(3891095228), static_cast<int32_t>(3891095229)}; std::array<int32_t, 4> y = {0, 0, static_cast<int32_t>(2465519915), static_cast<int32_t>(2465519915)}; auto res = hash(x, y, seed); std::print("{0:032b}|{0:08x}\n", res[0]); std::print("{0:032b}|{0:08x}\n", res[1]); std::print("{0:032b}|{0:08x}\n", res[2]); std::print("{0:032b}|{0:08x}\n", res[3]); }

@thomastc aaaah now I understand what you're trying to achieve!
Maybe then a simple splitmix64 is nicer for you:
https://godbolt.org/z/b6Yx4fWGs
Note that you could shorten the constants considering you're only keeping the lower 32 bit. Also note that I'm disappointed in GCC missing the chance to put all four array elements in the vector version into the same 256 bit SIMD register automatically.

**thomastc | frozenfractal** @thomastc · Mar 19

@funkylab Yep, that's the ticket!

Meanwhile, I've found that Inigo Quilez's Integer Hash21 method has sufficient quality for my purposes: https://www.shadertoy.com/view/4tXyWN I just mixed the seed into the input vector. At the end I'm splitting the output into 2 parts of 16 bits, then making floats out of that, then normalizing that vector... it doesn't seem to matter that the lower bits are lower quality.

I even slimmed it down a bit by removing the (p>>28) and ^(n>>15) and it's still fine.

**Steven Dollins** @scdollins@genart.social · Mar 18

Steven Dollins @scdollins@genart.social

@thomastc Again in GLSL, I like to use the pcg4d hash from https://jcgt.org/published/0009/03/02/

Perhaps you could adapt it to SSE on the CPU. Or try applying a little bit-swizzling to the CRC32 input or output.

**thomastc | frozenfractal** @thomastc · Mar 18

@scdollins Hmm, interesting. I had seen that paper, but was a bit overwhelmed with the number of options. I think I'd want pcg3d, and pass the seed as the third dimension, and then XOR the three outputs together to get the index into the gradient table?

**thomastc | frozenfractal** @thomastc · Mar 18

@scdollins Orrr maybe I could interpret the output as a vec2 of floats and normalize it, doing away with the gradient table altogether.

**thomastc | frozenfractal** @thomastc · Mar 18