#programming question, TL;DR: How to test for an (approximately) uniform
#distribution?
Today at work, I created a piece of code that should
#partition a stream of data entities based on some string keys of unknown format. The only requirements were that the same key must always be assigned to the same partition and the distribution should be approximately uniform (IOW all partitions should have roughly the same size). My approach was to apply a non-cryptographic
#hash function to the keys (defaulting to
#xxhash3), XOR-fold the hash down to 32 bits and then take this as an unsigned integer modulo the desired number of partitions.
I normally only code my private projects (as a software architect, I rarely have the time to touch any code at work, unfortunately), and there, I'd certainly test something like this on some large samples of input data, but probably just once manually.

But for work, I felt this should be done by a
#unittest. I also think at least one set of input data should be somehow "random" (while others should contain "patterns"). My issue is with unit-testing the case for "random" input. One test I wrote feeds 16k GUIDs (in string representation) to my partitioner configured for 13 partitions, and checks that the factor between the largest and smallest partitions remains < 2, so, a
very relaxed check. Still doubt remains because there's no way to guarantee this test won't go "red"
eventually.
I now see several possible options:
- just ignore this because hell freezing is more likely than that test going red ...
- don't even attempt to test the resulting distribution on "random" input
- bite the bullet and write some extra code creating "random" (unicode, random length within some limits) strings from a PRNG which will produce a predictable sequence
What do you think?

The latter option kind of sounds best, but then the complexity of the test will probably exceed the complexity of the code tested.
