mastodon.gamedev.place is one of the many independent Mastodon servers you can use to participate in the fediverse.
Mastodon server focused on game development and related topics.

Server stats:

5.4K
active users

@aras *whispering into ear* try multiple passes but not over the whole data, un-delta ballpark 1k-4k data at a time to the stack and reorder from there

@aras if you want good SIMD perf, _the_ go-to strat is to chain multiple passes, each of which is dumb as a rock, chunked into pieces that fit comfortably in L1

@rygorous what seems to be killing perf in my case is scattered writes into memory, since each byte out of say stride=16 data gets scattered really far away from one another. Just not doing reordering of "everything", but rather by ~1MB sized blocks gets the perf quite a bit up, at a tiny loss of compression ratio.

But then trying your "un-delta 1k-4k into stack" on top of that so far did not get faster. But I very well might be overlooking smth.

@aras you shouldn't need to be introducing blocks in the format at all for this

@aras (or maybe I'm misunderstanding what your format is)

@aras Your stores of the output data should always be writing large contiguous chunks (preferably 64B+) at a time. No individual byte stores anywhere, 4+ contiguous SIMD stores please.

@aras The main work loop should be un-delta-ing something like 256B-1K each from the 16 streams, and then you do the transpose+write

@aras ..actually, for this particular task (with 16 byte "channels") there really shouldn't be any prefix sums in this at all

Fabian Giesen

@aras you're already transposing; it's better to do transpose first and then you're left with a much simpler "vertical" sum (bog-standard 16-lane byte vector add).

Now, reading from 16 streams at the same time for the transpose is generally fraught with peril depending on the stride between streams, mainly for cache associativity reasons. (For k*2^l for the right k, l integer k, l it will fall off a cliff).

@aras I would still go for a two-pass approach here, conceptually:

for (chunks of N elements) {
for groups of 4 streams {
read and interleave N values from 4 streams each, store to stack
}
for elements {
read and interleave groups of 4 streams from stack, sum into running total, store to dest
}
}

@aras the first pass produces 32-bit values consisting of byte-wise deltas (if you have groups other than 16B this part stays the same), second pass assembles the 32-bit groups back into 4-element vectors and resolve the deltas.

@aras That second half is different if you have other component counts. It's still prefix-sum-y in that case, but summing across larger groups is normally still substantially cheaper than the full log2(nelems) reduction when you sum across all lanes in a vector.

@rygorous got it, thanks! A mistake I made is: I split N byte structs into N byte streams and then delta all of that, instead of “delta each of N streams”. With the latter, I can do exactly what you wrote, so I’m gonna try that!

(EXR image format has the same kinda-mistake, it splits halfs/floats into byte streams but then delta encodes that one resulting large sequence)

@aras @rygorous Catching up on the thread 🧑‍🏫

I was starting to apply the filters on texture data but the extra buffer I needed for the shuffle wasn’t super nice so I ended up applying the shuffle in 32K blocks. Which was a bit less compression but not horrible. This wasn’t thinking about perf at all, just to save the big alloc.

Now, doing both filters in chunks and thinking about it on byte streams (and delta those separately) seems like what a grown up would do and makes lots of sense 😊

@rygorous so doing what you suggest is a large speedup, noice. But, I see practically no difference between fetching 16b from each stream, vs fetching any larger amount (256, 1k etc). I guess that should not happen, and I did some stupids?

@aras 16B is fine most of the time but you should see some really nasty anomalies with 16B fetches when the distances between streams are at or near multiples of medium to large powers of 2 (multiples of ~4KB to ~64KB are where the danger zone starts) that should mostly disappear with the larger amounts.

@aras E.g. for the 16-stream case, the risk when doing 16 loads, a 16x16 transpose, then a sum all in one go is that if the distances are a multiple of 4k or similar, all the source streams want to go into the same set in the L1D cache, and possibly clash in higher cache levels as well depending on how the physical addresses work out.

@aras with the now-common 64B or 128B cache lines, a 16B fetch means that each cache line will get fetched, 16B of it used, and then before you use it next time it might get forcibly evicted again, so you get these really bad perf cliffs around certain magic numbers.

@aras Splitting it into groups of 4 streams and doing larger element counts within them gets you out of the woods on any L1D that's 4-way associative or higher no matter what. Even if there's thrashing at some levels that kicks out that data earlier than you'd like, you can greatly reduce the number of cache lines getting re-fetched.

@aras Anyway, for this approach, you should expect to end up reasonably close to memcpy speed for large inputs, i.e. the main contribution to cost should just be the cost of doing a whole extra pass over the data.

@aras There is another big step-function change in decode perf after this one available if you're OK with lower compression ratio but want faster encode/decode.

Namely, cut source data into chunks and filter then compress them independently. 64k-256k are about the right ballpark these days, in this case since the transform isn't in-place I'd stay <=128k.

@aras That costs you compression ratio since you lose far matches and increase constant overheads, but it makes ~all the LZ matches hit L1 or L2 and your filter likewise has all the data in L2, so instead of streaming decomp data into memory, back into cache for pass 2, then un-filtered data back out into mem, you only stream out decomp data once, when it's done.

@aras In the specific case of Oodle where you can (and should for ideal perf) give us workspace memory for the decoders, it's also a very good idea to use the same memory for other passes that run on that data after decomp is done. Basically have one block of workspace and keep using that for all temporary storage that's a bit too large to put on the stack (if any).

@aras That's because that memory, especially near the front (we use a linear allocator and the workspace mem estimate is for the worst case), is already warm in your L2 or possibly even L1D caches after the decoder finishes - and if you keep using it for your temps (if any), it also stays warm until the next chunk decode comes around.

@aras If you set things up nicely, you end up with a de facto ~128k-256k or so of thread-local scratchpad memory that never leaves its cores' L2 and pretty much the only L3 or memory accesses in there are when streaming compressed data in or fully decompressed data out.

@aras a very desirable characteristic that we in The Biz, by which to say mainly Jeff who has never entirely left the 80s, like to call "The Sweet Action"

@rygorous yeah I discovered this one already :)

@aras In that case, you are now almost out of major levers to pull (except for throwing wider vectors at it, where available)

@rygorous you forgot the major remaining ultimate optimization: encode data to json, send to the cloud, where we process it via node.js! *ducks*

@aras *stands in the rain, looking furious* You have made an enemy today. GOOD DAY, sir.