Aras Pranckevičius @aras

Recent searches

Search options

Only available when logged in.

**Aras Pranckevičius** @aras · Feb 18, 2023

Feb 18, 2023

Part 6 of floating point data compression adventures! Lame attempts at trying to speed up the data filtering part. https://aras-p.info/blog/2023/02/18/Float-Compression-6-Filtering-Optimization/

Aras' website Float Compression 6: Filtering Optimization · Aras' website

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras *whispering into ear* try multiple passes but not over the whole data, un-delta ballpark 1k-4k data at a time to the stack and reorder from there

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras *more whispering* splitting into 16 streams (and going back, it's self-inverse) is a 16x16 byte matrix transpose and I wrote about those 10 years ago https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/ https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/

The ryg blog · Jul 9, 2013SIMD transposes 1This one tends to show up fairly frequently in SIMD code: Matrix transposes of one sort or another. The canonical application is transforming data from AoS (array of structures) to the more SIMD-fr…

**Fabian Giesen** @rygorous · Feb 18, 2023 *

Feb 18, 2023 *

Fabian Giesen @rygorous

@aras if you want good SIMD perf, _the_ go-to strat is to chain multiple passes, each of which is dumb as a rock, chunked into pieces that fit comfortably in L1

**Aras Pranckevičius** @aras · Feb 18, 2023

Feb 18, 2023

Aras Pranckevičius @aras

@rygorous what seems to be killing perf in my case is scattered writes into memory, since each byte out of say stride=16 data gets scattered really far away from one another. Just not doing reordering of "everything", but rather by ~1MB sized blocks gets the perf quite a bit up, at a tiny loss of compression ratio.

But then trying your "un-delta 1k-4k into stack" on top of that so far did not get faster. But I very well might be overlooking smth.

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras you shouldn't need to be introducing blocks in the format at all for this

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras (or maybe I'm misunderstanding what your format is)

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras Your stores of the output data should always be writing large contiguous chunks (preferably 64B+) at a time. No individual byte stores anywhere, 4+ contiguous SIMD stores please.

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras The main work loop should be un-delta-ing something like 256B-1K each from the 16 streams, and then you do the transpose+write

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras ..actually, for this particular task (with 16 byte "channels") there really shouldn't be any prefix sums in this at all

Fabian Giesen @rygorous@mastodon.gamedev.place

@aras you're already transposing; it's better to do transpose first and then you're left with a much simpler "vertical" sum (bog-standard 16-lane byte vector add).

Now, reading from 16 streams at the same time for the transpose is generally fraught with peril depending on the stride between streams, mainly for cache associativity reasons. (For k*2^l for the right k, l integer k, l it will fall off a cliff).

Feb 19, 2023, 01:43 AM·

0boosts·0favorites

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras I would still go for a two-pass approach here, conceptually:

for (chunks of N elements) {
for groups of 4 streams {
read and interleave N values from 4 streams each, store to stack
}
for elements {
read and interleave groups of 4 streams from stack, sum into running total, store to dest
}
}

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras the first pass produces 32-bit values consisting of byte-wise deltas (if you have groups other than 16B this part stays the same), second pass assembles the 32-bit groups back into 4-element vectors and resolve the deltas.

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras That second half is different if you have other component counts. It's still prefix-sum-y in that case, but summing across larger groups is normally still substantially cheaper than the full log2(nelems) reduction when you sum across all lanes in a vector.

**Aras Pranckevičius** @aras · Feb 19, 2023

Feb 19, 2023

Aras Pranckevičius @aras

@rygorous got it, thanks! A mistake I made is: I split N byte structs into N byte streams and then delta all of that, instead of “delta each of N streams”. With the latter, I can do exactly what you wrote, so I’m gonna try that!

(EXR image format has the same kinda-mistake, it splits halfs/floats into byte streams but then delta encodes that one resulting large sequence)

**Rubén Osorio** @osor_io · Feb 19, 2023

Feb 19, 2023

Rubén Osorio @osor_io

@aras @rygorous Catching up on the thread

I was starting to apply the filters on texture data but the extra buffer I needed for the shuffle wasn’t super nice so I ended up applying the shuffle in 32K blocks. Which was a bit less compression but not horrible. This wasn’t thinking about perf at all, just to save the big alloc.

Now, doing both filters in chunks and thinking about it on byte streams (and delta those separately) seems like what a grown up would do and makes lots of sense

**Aras Pranckevičius** @aras · Feb 21, 2023

Feb 21, 2023

Aras Pranckevičius @aras

@rygorous so doing what you suggest is a large speedup, noice. But, I see practically no difference between fetching 16b from each stream, vs fetching any larger amount (256, 1k etc). I guess that should not happen, and I did some stupids?

**Fabian Giesen** @rygorous · Feb 21, 2023 *

Feb 21, 2023 *

Fabian Giesen @rygorous

@aras 16B is fine most of the time but you should see some really nasty anomalies with 16B fetches when the distances between streams are at or near multiples of medium to large powers of 2 (multiples of ~4KB to ~64KB are where the danger zone starts) that should mostly disappear with the larger amounts.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras E.g. for the 16-stream case, the risk when doing 16 loads, a 16x16 transpose, then a sum all in one go is that if the distances are a multiple of 4k or similar, all the source streams want to go into the same set in the L1D cache, and possibly clash in higher cache levels as well depending on how the physical addresses work out.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras with the now-common 64B or 128B cache lines, a 16B fetch means that each cache line will get fetched, 16B of it used, and then before you use it next time it might get forcibly evicted again, so you get these really bad perf cliffs around certain magic numbers.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras Splitting it into groups of 4 streams and doing larger element counts within them gets you out of the woods on any L1D that's 4-way associative or higher no matter what. Even if there's thrashing at some levels that kicks out that data earlier than you'd like, you can greatly reduce the number of cache lines getting re-fetched.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras Anyway, for this approach, you should expect to end up reasonably close to memcpy speed for large inputs, i.e. the main contribution to cost should just be the cost of doing a whole extra pass over the data.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras There is another big step-function change in decode perf after this one available if you're OK with lower compression ratio but want faster encode/decode.

Namely, cut source data into chunks and filter then compress them independently. 64k-256k are about the right ballpark these days, in this case since the transform isn't in-place I'd stay <=128k.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras That costs you compression ratio since you lose far matches and increase constant overheads, but it makes ~all the LZ matches hit L1 or L2 and your filter likewise has all the data in L2, so instead of streaming decomp data into memory, back into cache for pass 2, then un-filtered data back out into mem, you only stream out decomp data once, when it's done.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras In the specific case of Oodle where you can (and should for ideal perf) give us workspace memory for the decoders, it's also a very good idea to use the same memory for other passes that run on that data after decomp is done. Basically have one block of workspace and keep using that for all temporary storage that's a bit too large to put on the stack (if any).

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras That's because that memory, especially near the front (we use a linear allocator and the workspace mem estimate is for the worst case), is already warm in your L2 or possibly even L1D caches after the decoder finishes - and if you keep using it for your temps (if any), it also stays warm until the next chunk decode comes around.

**Fabian Giesen** @rygorous · Feb 21, 2023

Feb 21, 2023

Fabian Giesen @rygorous

@aras If you set things up nicely, you end up with a de facto ~128k-256k or so of thread-local scratchpad memory that never leaves its cores' L2 and pretty much the only L3 or memory accesses in there are when streaming compressed data in or fully decompressed data out.

**Fabian Giesen** @rygorous · Feb 21, 2023 *

Feb 21, 2023 *

Fabian Giesen @rygorous

@aras a very desirable characteristic that we in The Biz, by which to say mainly Jeff who has never entirely left the 80s, like to call "The Sweet Action"