Part 6 of floating point data compression adventures! Lame attempts at trying to speed up the data filtering part. https://aras-p.info/blog/2023/02/18/Float-Compression-6-Filtering-Optimization/
@aras *whispering into ear* try multiple passes but not over the whole data, un-delta ballpark 1k-4k data at a time to the stack and reorder from there
@aras *more whispering* splitting into 16 streams (and going back, it's self-inverse) is a 16x16 byte matrix transpose and I wrote about those 10 years ago https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/ https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/
@aras if you want good SIMD perf, _the_ go-to strat is to chain multiple passes, each of which is dumb as a rock, chunked into pieces that fit comfortably in L1
@rygorous what seems to be killing perf in my case is scattered writes into memory, since each byte out of say stride=16 data gets scattered really far away from one another. Just not doing reordering of "everything", but rather by ~1MB sized blocks gets the perf quite a bit up, at a tiny loss of compression ratio.
But then trying your "un-delta 1k-4k into stack" on top of that so far did not get faster. But I very well might be overlooking smth.
@aras you shouldn't need to be introducing blocks in the format at all for this
@aras (or maybe I'm misunderstanding what your format is)
@aras Your stores of the output data should always be writing large contiguous chunks (preferably 64B+) at a time. No individual byte stores anywhere, 4+ contiguous SIMD stores please.
@aras The main work loop should be un-delta-ing something like 256B-1K each from the 16 streams, and then you do the transpose+write
@aras ..actually, for this particular task (with 16 byte "channels") there really shouldn't be any prefix sums in this at all
@aras you're already transposing; it's better to do transpose first and then you're left with a much simpler "vertical" sum (bog-standard 16-lane byte vector add).
Now, reading from 16 streams at the same time for the transpose is generally fraught with peril depending on the stride between streams, mainly for cache associativity reasons. (For k*2^l for the right k, l integer k, l it will fall off a cliff).
@aras I would still go for a two-pass approach here, conceptually:
for (chunks of N elements) {
for groups of 4 streams {
read and interleave N values from 4 streams each, store to stack
}
for elements {
read and interleave groups of 4 streams from stack, sum into running total, store to dest
}
}
@aras the first pass produces 32-bit values consisting of byte-wise deltas (if you have groups other than 16B this part stays the same), second pass assembles the 32-bit groups back into 4-element vectors and resolve the deltas.
@aras That second half is different if you have other component counts. It's still prefix-sum-y in that case, but summing across larger groups is normally still substantially cheaper than the full log2(nelems) reduction when you sum across all lanes in a vector.
@rygorous got it, thanks! A mistake I made is: I split N byte structs into N byte streams and then delta all of that, instead of “delta each of N streams”. With the latter, I can do exactly what you wrote, so I’m gonna try that!
(EXR image format has the same kinda-mistake, it splits halfs/floats into byte streams but then delta encodes that one resulting large sequence)
@aras @rygorous Catching up on the thread
I was starting to apply the filters on texture data but the extra buffer I needed for the shuffle wasn’t super nice so I ended up applying the shuffle in 32K blocks. Which was a bit less compression but not horrible. This wasn’t thinking about perf at all, just to save the big alloc.
Now, doing both filters in chunks and thinking about it on byte streams (and delta those separately) seems like what a grown up would do and makes lots of sense