Aras Pranckevičius @aras

Recent searches

Search options

Only available when logged in.

**Aras Pranckevičius** @aras · Feb 18, 2023

Feb 18, 2023

Part 6 of floating point data compression adventures! Lame attempts at trying to speed up the data filtering part. https://aras-p.info/blog/2023/02/18/Float-Compression-6-Filtering-Optimization/

Aras' website Float Compression 6: Filtering Optimization · Aras' website

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras *whispering into ear* try multiple passes but not over the whole data, un-delta ballpark 1k-4k data at a time to the stack and reorder from there

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras *more whispering* splitting into 16 streams (and going back, it's self-inverse) is a 16x16 byte matrix transpose and I wrote about those 10 years ago https://fgiesen.wordpress.com/2013/07/09/simd-transposes-1/ https://fgiesen.wordpress.com/2013/08/29/simd-transposes-2/

The ryg blog · Jul 9, 2013SIMD transposes 1This one tends to show up fairly frequently in SIMD code: Matrix transposes of one sort or another. The canonical application is transforming data from AoS (array of structures) to the more SIMD-fr…

**Fabian Giesen** @rygorous · Feb 18, 2023 *

Feb 18, 2023 *

Fabian Giesen @rygorous

@aras if you want good SIMD perf, _the_ go-to strat is to chain multiple passes, each of which is dumb as a rock, chunked into pieces that fit comfortably in L1

**Aras Pranckevičius** @aras · Feb 18, 2023

Feb 18, 2023

Aras Pranckevičius @aras

@rygorous what seems to be killing perf in my case is scattered writes into memory, since each byte out of say stride=16 data gets scattered really far away from one another. Just not doing reordering of "everything", but rather by ~1MB sized blocks gets the perf quite a bit up, at a tiny loss of compression ratio.

But then trying your "un-delta 1k-4k into stack" on top of that so far did not get faster. But I very well might be overlooking smth.

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras you shouldn't need to be introducing blocks in the format at all for this

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras (or maybe I'm misunderstanding what your format is)

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras Your stores of the output data should always be writing large contiguous chunks (preferably 64B+) at a time. No individual byte stores anywhere, 4+ contiguous SIMD stores please.

**Fabian Giesen** @rygorous · Feb 18, 2023

Feb 18, 2023

Fabian Giesen @rygorous

@aras The main work loop should be un-delta-ing something like 256B-1K each from the 16 streams, and then you do the transpose+write

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras ..actually, for this particular task (with 16 byte "channels") there really shouldn't be any prefix sums in this at all

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras you're already transposing; it's better to do transpose first and then you're left with a much simpler "vertical" sum (bog-standard 16-lane byte vector add).

Now, reading from 16 streams at the same time for the transpose is generally fraught with peril depending on the stride between streams, mainly for cache associativity reasons. (For k*2^l for the right k, l integer k, l it will fall off a cliff).

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras I would still go for a two-pass approach here, conceptually:

for (chunks of N elements) {
for groups of 4 streams {
read and interleave N values from 4 streams each, store to stack
}
for elements {
read and interleave groups of 4 streams from stack, sum into running total, store to dest
}
}

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras the first pass produces 32-bit values consisting of byte-wise deltas (if you have groups other than 16B this part stays the same), second pass assembles the 32-bit groups back into 4-element vectors and resolve the deltas.

**Fabian Giesen** @rygorous · Feb 19, 2023

Feb 19, 2023

Fabian Giesen @rygorous

@aras That second half is different if you have other component counts. It's still prefix-sum-y in that case, but summing across larger groups is normally still substantially cheaper than the full log2(nelems) reduction when you sum across all lanes in a vector.

**Aras Pranckevičius** @aras · Feb 19, 2023

Feb 19, 2023

Aras Pranckevičius @aras

@rygorous got it, thanks! A mistake I made is: I split N byte structs into N byte streams and then delta all of that, instead of “delta each of N streams”. With the latter, I can do exactly what you wrote, so I’m gonna try that!

(EXR image format has the same kinda-mistake, it splits halfs/floats into byte streams but then delta encodes that one resulting large sequence)

Rubén Osorio @osor_io@mastodon.gamedev.place

@aras @rygorous Catching up on the thread

I was starting to apply the filters on texture data but the extra buffer I needed for the shuffle wasn’t super nice so I ended up applying the shuffle in 32K blocks. Which was a bit less compression but not horrible. This wasn’t thinking about perf at all, just to save the big alloc.

Now, doing both filters in chunks and thinking about it on byte streams (and delta those separately) seems like what a grown up would do and makes lots of sense

Feb 19, 2023, 11:33 AM··Mastodon for iOS

0boosts·0favorites

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back