Aras Pranckevičius @aras

Recent searches

Search options

Only available when logged in.

Maciej Sinilo @msinilo@mastodon.gamedev.place

https://gcc.godbolt.org/z/YhrEcb7nz
Am I correct assuming that fmadd and vmul/vadd versions might not give me the exact same results (down in Test)? (Double checking.. )

gcc.godbolt.orgCompiler Explorer - C++ (x64 msvc v19.latest)using __vec4 = __m128; __vec4 __vdot3ns(const __vec4& a, const __vec4& b) { __vec4 product = _mm_mul_ps(a, b); __vec4 y = _mm_movehdup_ps(product); __vec4 z = _mm_movehl_ps(product, product); __vec4 sum = _mm_add_ss(product, y); return(_mm_add_ss(sum, z)); } float Pred(const __vec4& a, const __vec4& b) { __vec4 result = __vdot3ns(a, b); return(reinterpret_cast<float*>(&result)[0]); } void Test(__vec4* _First, __vec4* _Last, const __vec4& x) { if (_First != _Last) { for (auto _Mid = _First; ++_Mid != _Last;) { // order next element auto _Hole = _Mid; auto _Val = (*_Mid); if (Pred(_Val, x)<Pred(*_First, x)) { // found new earliest element, move to front //_Move_backward_unchecked(_First, _Mid, ++_Hole); *_First = (_Val); } else { // look for insertion point after first for (auto _Prev = _Hole; Pred(_Val,x)<Pred(*--_Prev,x); _Hole = _Prev) { *_Hole = (*_Prev); // move hole down } *_Hole = (_Val); // insert element in hole } } } }

Jun 16, 2023, 01:10 PM·

0boosts·0favorites

**Maciej Sinilo** @msinilo · Jun 16, 2023

Jun 16, 2023

Maciej Sinilo @msinilo

... thanks for confirmations.. Practice agrees as well and sometimes the sort will explode :) (this was a copy-pasted bit of some of the <algorithm> sort code)

**Maciej Sinilo** @msinilo · Jun 17, 2023

Jun 17, 2023

Maciej Sinilo @msinilo

http://msinilo.pl/blog2/post/not-all-dot-products-are-equal/ - a bit longer post

msinilo.plNot all dot products are equal

**Braedy** @braaedy · Jun 16, 2023

Jun 16, 2023

Braedy @braaedy

@msinilo I think you'd probably have to check the manual for your specific CPU to 100% know, but it's probably safe to assume that there's a potential precision loss due to a round between the vmul/vadd that may not exist in the fmadd.

**Bartosz Taudul** @wolfpld · Jun 16, 2023

Jun 16, 2023

Bartosz Taudul @wolfpld

@msinilo One does round the intermediate result, the other one does not.

**Dominik Grabiec** @dominikg · Jun 16, 2023

Jun 16, 2023

Dominik Grabiec @dominikg

@msinilo
It is my understanding / assumption for that to be the case given that the fused multiply and add would do it all within internal registers and not need to write it out to memory. So there would be no rounding applied to the internal values.

**Aras Pranckevičius** @aras · Jun 17, 2023

Jun 17, 2023

Aras Pranckevičius @aras

@msinilo is fp:fast generally worth the trouble it causes? It could lead to even simpler cases like compiler turning (a+b)+c into a+(b+c) which is not the same. And yeah, sorting predicates can and will go wrong with things like that.

**Steve Anichini** @solidangle · Jun 17, 2023

Jun 17, 2023

Steve Anichini @solidangle

@aras @msinilo /fp:fast on VC++ has burned me enough times I turn it off in my own engine. /fp:broken

**Michael Vance** @mtothevizzah · Jun 18, 2023 *

Jun 18, 2023 *

Michael Vance @mtothevizzah

@solidangle @aras @msinilo broken is probably tough for the team but maybe /fp:imprecise or /fp:nondeterministic or /fp:imfeelinglucky.

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@aras @msinilo I believe this is also @rygorous 's attitude. He may have subtly conveyed it to us once or twice.

**Maciej Sinilo** @msinilo · Jun 17, 2023

Jun 17, 2023

Maciej Sinilo @msinilo

@TomF @aras @rygorous yeah, I wouldnt argue with Ryg, but there is some inertia here.. We have been running with it for years, it mostly 'works' and perf gains are somewhat measurable. I do ask myself same question every time stuff like this happen though...

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@msinilo @TomF @aras @rygorous I argued for the inclusion of fast math in rust for a while, but it's the wrong feature for the problem it's trying to solve; instead of a hammer across the whole codebase I think more localized solutions would be better. We currently run rust as-is with strict math ops, and our shaders too. I suspect we take a hit for it, but I'll only disable it if we're ever /really/ desperate for performance. For now, sidestepping a whole class of bugs is much more preferable.

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@JasperBekkers @msinilo @aras @rygorous It might be interesting to have a compiler look at your code and tell you where it thinks something could be improved, and then you'd manually refactor things until it was happy that it couldn't improve it any more?

**Bart Wronski** @BartWronski · Jun 17, 2023

Jun 17, 2023

Bart Wronski @BartWronski

@TomF @JasperBekkers @msinilo @aras @rygorous my take on fast math is similar (evil!), but this proposed solution IMO would be strictly worse (for anything that is not write-once or for experts) - reading to unreadable, unclear code. I want something like local fast math with local decorators.

Halide has some partial solutions for it, where you can ask some functions / expressions to be evaluated at a specific point and folded out.

**Fabian Giesen** @rygorous · Jun 17, 2023

Jun 17, 2023

Fabian Giesen @rygorous

@BartWronski @TomF @JasperBekkers @msinilo @aras I'd be happy for something local like function or even scope-level annotations of "feel free to reassociate this" but "fast math" is an incredibly blunt tool and violates most intuitive notions of what a function even is, in a way that no other optimizations do

**Fabian Giesen** @rygorous · Jun 17, 2023

Jun 17, 2023

Fabian Giesen @rygorous

@BartWronski @TomF @JasperBekkers @msinilo @aras e.g. with fast math, you can have a pure function f, x==y, but f(x)!=f(y) when they're evaluated in different contexts, which is a huge departure from language semantics we would not tolerate elsewhere

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@rygorous @BartWronski @TomF @msinilo @aras Full scoped fast math some issues; e.g. `#[fast_math]{ sqrt(1.0) }` would still be problematic in its results. Yes you've been clear during the operation, but stil... Scoping it down to specific optimizations (contraction/reassociation/auto reciprocals) may be useful but could be done manually or as a lint. (1/2)

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@rygorous @BartWronski @TomF @msinilo @aras That way it's more explicit for the reader. Some others (approximations) could be through done explicit function calls, however others (non signed zeros, not assuming nan/inf) feel to me like they'd be appropriate for scope blocks. (2/2)

**Fabian Giesen** @rygorous · Jun 17, 2023

Jun 17, 2023

Fabian Giesen @rygorous

@JasperBekkers @BartWronski @TomF @msinilo @aras I was assuming you would specify what was allowed, fast math is too big an umbrella anyhow.

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@rygorous @BartWronski @TomF @msinilo @aras My point exactly , it I'm also trying to figure out a way to work with it; fastmath's usefulness is in a large part as a "code is slow, please make it fast with minimal investment on my part" kinda tool. Giving up some of that convenience might pave a way to also trade away some of its downsides.

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@JasperBekkers @rygorous @BartWronski @msinilo @aras Hence my suggestion of "compiler-guided optimisations". It suggests stuff, you decide whether or not to add those annotations.

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@JasperBekkers @rygorous @BartWronski @msinilo @aras We sort of have some of that already with the compiler saying "hey did you actually mean to use a double here?"

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@TomF @rygorous @BartWronski @msinilo @aras I see where you're coming from, sort of, on a high level, but I'm not sure how it would actually turn out in practice. e.g. for reassociation, reciprocals or contraction maybe it can suggest a way to rewrite the equations for you, but after that it would feel like this quickly falls apart. Would you want the compiler to suggest using approximations to functions of which it can't likely know the input? (1/2)

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@TomF @rygorous @BartWronski @msinilo @aras Can it know that if you tell it, "please assume no nans for this code" that things will get significantly faster. How would your prevents overloads of false positives or massive compiler spam (2/2)

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@JasperBekkers @rygorous @BartWronski @msinilo @aras I'm suggesting that anything the compiler would normally decide to do for you with fastmath, it instead suggests adding an annotation to allow. That's all.

**Tom Forsyth** @TomF · Jun 17, 2023

Jun 17, 2023

Tom Forsyth @TomF

@BartWronski @JasperBekkers @msinilo @aras @rygorous The improvements it can make are really only interesting in tuned kernels anyway, so becoming write-only doesn't seem like that big a deal. I just don't see any alternative - it either becomes write-only in an explicit way, or it has Magical Transforms You Can't Predict - which is really just another way of being write-only.

**Bart Wronski** @BartWronski · Jun 17, 2023

Jun 17, 2023

Bart Wronski @BartWronski

@TomF @JasperBekkers @msinilo @aras @rygorous I referenced Halide not by accident - as it's the only language I know that makes writing hot kernels that are readable and rewriteable somewhat possible. Not perfect, doesn't fully deliver what it promised - but sadly, don't know of any other non-academic, shippable languages that even try to solve this problem.

**Jasper Bekkers** @JasperBekkers · Jun 17, 2023

Jun 17, 2023

Jasper Bekkers @JasperBekkers

@TomF @msinilo @aras @rygorous Wouldn't a compiler more or less have to compile everything twice (or more, for more detailed information) to figure out if something's going to be slower (statically)?

**Fabian Giesen** @rygorous · Jun 17, 2023

Jun 17, 2023

Fabian Giesen @rygorous

@TomF @aras @msinilo FUCK FAST MATH

**Joacim Jacobsson** @jjacobsson · Jun 17, 2023

Jun 17, 2023

Joacim Jacobsson @jjacobsson

@rygorous @TomF @aras @msinilo but what if it gets pregnant?

**Bart Wronski** @BartWronski · Jun 17, 2023

Jun 17, 2023

Bart Wronski @BartWronski

@aras @msinilo absolutely not, absolutely not by default. Ok on a case by case basis on pieces of code, but the practice of "it's 'fast', so let's turn it on brrrr" is super evil and causing coworkers or your future self pain. Flaky, nondeterministic tests, subtle or not so subtle bugs, changes of functionality upon compiler upgrade or change of platform.

**Daniel Gibson** @Doomed_Daniel · Jun 17, 2023

Jun 17, 2023

Daniel Gibson @Doomed_Daniel

@aras @msinilo but the compiler used fma instead of mul and add is not necessarily a /fp:fast or -ffast-math thing - IEEE 754 allows that kind of optimization, and for example GCC will do it by default (without -ffast-math), it must be explicitly disabled with `-ffp-contract=off` if you don't want it

**Arseny Kapoulkine** @zeux · Jun 18, 2023 *

Jun 18, 2023 *

Arseny Kapoulkine @zeux

@msinilo Separately from fp:fast, it’s so very wrong that std::sort (most implementations on most platforms) can do out of bounds reads or writes on predicates that don’t conform to ordering requirements…

**Tom Forsyth** @TomF · Jun 16, 2023

Jun 16, 2023

Tom Forsyth @TomF

@msinilo Correct - "fused" multiply add does no intermediate rounding between the multiply and the add. Now, whether or not your compiler chooses to use the fused version or not - that's all compiler magic to figure out.

Also, dot products are fun even without fused multiply-add, because you don't know if it's doing (X*X+Y*Y)+(Z*Z) or (X*X)+(Y*Y+Z*Z) (or some other combo), which again because of rounding differences can produce different results.

**Dominik Grabiec** @dominikg · Jun 16, 2023

Jun 16, 2023

Dominik Grabiec @dominikg

@TomF @msinilo The joys of floating point.

Sometimes I think it might be better to store level data in signed integers, where depending on scale 1 is either a millimetre or a tenth of a millimetre. Don't even need fixed point for this purpose.

**Tom Forsyth** @TomF · Jun 16, 2023

Jun 16, 2023

Tom Forsyth @TomF

@dominikg @msinilo Sometimes I also think this: https://cohost.org/tomforsyth/post/943070-a-matter-of-precisio

Tom Forsyth on cohostA Matter Of PrecisionDouble precision is not magic - consider using fixed-point in many cases. Context: this is part of a series of reposts of some of my more popular blog posts. It was originally posted in May 2006: http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BA%20matter%20of%20precision%5D%5D [http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BA%20matter%20of%20precision%5D%5D]. This version has been edited slightly to update for the intervening 17 years! In retrospect it's really two separate posts in one. It starts as a rant against double precision, and then turns into a love-letter about fixed point. Even if you don't hate double precision with quite the burning ferocity that I do, please stay for the talk about the merits of fixed precision, because it keeps being useful in all sorts of places. Anyway, to the post... THE PROBLEM WITH FLOATS Floating-point numbers are brilliant - they have decent precision and a large range. But they still only have 32 bits, so there's still only 4billion different ones (actually there's fewer than that if you ignore all the varieties of NANs and infs). So they have tradeoffs and weaknesses just like everything else. You need to know what they are, and what happens for example when you subtract one small number from another - you get imprecision. ---------------------------------------- Ideally every programmer should know the basics of floating-point numerical precision. Any time you do a subtract (or an add, implicitly), consider what happens when the two numbers are very close together, e.g. 1.0000011 - 1.0. The result of this is going to be roughly 0.0000011 of course, but the "roughly" is going to be pretty rough. In general you only get about six-and-a-bit decimal digits of precision from floats (2^23 is 8388608), so the problem is that 1.0000011 isn't very precise - it could be anywhere between 1.0000012 or 1.0000010. So the result of the subtraction is anywhere between 1.2*10^-6 and 1.0*10^-6. That's not very impressive precision, having the second digit be wrong! So you need to refactor your algorithms to fix this. The most obvious place this happens in games is when you're storing world coordinates in standard float32s, and two objects get a decent way from the origin. The first thing you do in rendering is to subtract the camera's position from each object's position, and then send that all the way down the rendering pipeline. The rest all works fine, because everything is relative to the camera. But it's that first subtraction that is the problem. For example, getting only six decimal digits of precision, if you're 10km from the origin (London is easily over 20km across), you'll only get about 1cm accuracy. Which doesn't sound that bad in a static screenshot, but as soon as things start moving, you can easily see this horrible ~1cm jerkiness and quantisation. DOUBLE TROUBLE The solution is "obvious" - if single precision isn't enough, switch to storing your positions in double precision! Unfortunately there's some practical problems. The most obvious is that some machines simply don't do doubles at any reasonable speed, but typically these are older machines that maybe we don't care about any more. The other place that can be agonisingly slow are consumer GPUs - in theory doubles will work, but they're usually 1/16th the speed, or sometimes absent completely. Even on nice modern PCs, actually getting double-precision can be tricky - it's not enough to just write "double" in C - that would be far too simple! Bruce Dawson has a nice round-up of the (extensive) problems - this article is guaranteed to surprise even very experienced coders, so you should read it: https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/ [https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/] But let's say you went through all that build agony to get them working, and accepted the speed hit, and kept them off the GPU. Really - you haven't actually solved the problem, you've just brushed it under the carpet. Doubles have exactly the same weaknesses as any floating-point representation - variable precision and catastrophic cancellation. All you've done is shuffle the problems into a corner, stuck your fingers in your ears and yelled "lalalalalala". They'll still come back to bite you, and because it's double precision, it'll be even rarer and even harder to track down. And in exchange you've hurt your execution speed and caused build problems. FIX IT! So what's the alternative? Good old fixed-point numbers. They're an old-school technique from before the days of 68881 and 80x87 coprocessors. They're simply integers that you treat as if at some defined place in them is a decimal point - usually expressed as something like "24.8" - which means a 32-bit integer that represents a real-world value with 24 bits of whole value and 8 bits of fraction. So for example the real number X would be represented as the integer (int)(X*256) with appropriate rounding. For general maths, fixed point sounds like a real pain - and I'm not going to lie, it is. I've written raytracers in fixed-point maths, and the constant management of precision really sucks - even with good macro libraries. Float32 is a huge improvement in usability for 99% of maths code. But for absolute position and absolute time, fixed-point works really well. All you ever do with these is subtract one position from another (to produce a float32 delta), or add a (float32) delta to a position. You almost never need to do maths like multiply two positions together - that doesn't even "mean" anything. 64-bit fixed-point integers are supported on pretty much every platform. Even if they're not natively supported, they're easily emulated with things like add-with-carry instructions (and those platforms will have awful float64 support anyway) and will typically be invisible on any profile. But aside from speed, the best thing is that fixed-point representations have completely reliable precision. They're just integers - there's no scariness there, they're completely portable, easy to debug, and you know exactly what you're getting in terms of precision. This has huge implications for robustness and testability. WORKS ON MY MACHINE If you're like me, when you develop code, you do so in a tiny sandbox game level so you don't have to spend five minutes loading the assets for a full level. And yay - everything works fine. The physics works, the gameplay movement works, there's no falling-out-of-maps stuff, everything is smooth. So you check the code in and then several weeks later someone says there's some bad stuff happening in level X - objects falling through the floor, or moving erratically, or the physics is bouncing oddly. So you test level X and it seems fine, and then you have the "works on my machine" argument with the tester, and it takes you several days to figure out that it only happens in one part of level X, and wouldn't you know it - that's the part where you're a long way from the origin and some part of the physics code is subtracting two large numbers and getting an imprecise result. Curse you, floating point! The nice thing about fixed point is it's consistent. You get the same precision everywhere. If your time step and physics epsilons work at the origin of the map, and five seconds are starting the level, they work everywhere in the level, and even if you've been playing for hours. But aren't integers... you know... small? Not really. Remember that a float32 and an int32 still just have the same number of bits, and (roughly) the same number of representable numbers. There's no magic! Fixed-point has a surprising amount of range - 32 bits of fixed-point gets you anywhere on Earth to about 3mm. Not enough? Well 64 bits of precision gets you to the furthest distance of Pluto from the Sun (7.4 billion km) with sub-micrometer precision. And it's all consistent precision - no lumpy parts or fine parts. Test once, works everywhere. ABOUT TIME That's position, but let's talk about time. Time is where floats can fall over even worse, and it's the place that will most often bite people. If you are doing any sort of precise timing - physics, animation, sound playback - you need not just good precision, but totally reliable precision, because there tend to be a bunch of epsilons that need tuning. You're almost always taking deltas between absolute times, e.g. the current time and the time an animation started, or when a sound was triggered. Everything works fine in your game for the first five minutes, because absolute time probably started at zero, so with floats you're getting lots of precision. But play the level for four hours (or maybe just leave it paused while you go for food), and now everything's really janky and jittery. The reason is that four hours is right about 2^24 milliseconds, so you're running out of float precision for anything measured in milliseconds, which is why physics and sound are particularly susceptible - but almost any motion in a game will show this jittering. For added hilarity, if a tester does run across a problem with timing precision, they save the game, send it to a coder, and the coder loads it and... it doesn't happen - because time got reset to zero when they loaded the save game! This is a very effective way to drive QA completely insane. Again, the solution here is fixed-point. It gives you completely reliable precision, so if the sound or animations are working, they keep working no matter how long the game has been running. Floats are convenient, but ask yourself - why is it useful for have picosecond precision for the first five minutes, that drops to mere milliseconds after a couple of hours. It's interesting to note that fixed-point time has an easy precedent to follow - Network Time Protocol. They use a "32.32" format in seconds, meaning there's 32 bits measuring whole seconds, and 32 bits measuring fractions of a second. This has a precision of 233 picoseconds and rolls over every 136 years, both of which should be good enough for most games. In addition, it's still easy for humans to read and use - the top 32 bits is just the time in seconds. (as an aside - they are considering extending it to a 64.64 format to fix the rollover in 2036 - this gives them completely gratuitous range and precision - to quote: "the 64 bit value for the fraction is enough to resolve the amount of time it takes a photon to pass an electron (at the speed of light). The 64 bit second value is enough to provide unambiguous time representation until the universe goes dim."). Good enough for government work. PROTECT YOURSELF Even if you don't believe anything else in this post, even if you absolutely insist that floating point will be just fine for everything - please at the very least understand that setting your time to 0.0f at the start of your app is a terrible thing to do. You will fool yourself. If there is any chance at all that someone might play your game for four hours, or leave it on the main menu for four hours, or leave it paused mid-game for four hours, initialize your clock to a value much higher than four hours, so you will find and track down problems more quickly. Don't brush those problems under the rug to be discovered a week before ship. IMPLEMENTATION TIPS I think the easiest way to encapsulate fixed point is to make a "position" class that holds your chosen representation of fixed-point (x,y,z), and the only operations you allow on that class (aside from copies and the usual) are subtracting one from another to give a standard float32 vec3, and adding a float32 vec3 to a position to get another position. So it means you can't accidentally use a position without doing calculations relative to another position first. The same goes for time - internally you might use a 32.32 number, but all the outside world needs to know is what TimeA-TimeB is, and that can be a float32 quite happily. Similarly, the only operation you need to do on a time is adjust it forwards or backwards by a timestep, and it's fine to have the timestep as a float32. You should find that those operations are all you actually need. If not, then it's probably time to refactor some code, because you're likely to have problems when doing things at different distances from the origin, or at different times. The one exception I have seen is people making BSP trees out of the entire world, and then storing the planes as A,B,C,D where: Ax+By+Cz+D>0. Which is fine, except that's also going to have precision problems. And it's going to have precision problems far earlier, because x,y,z are another object's position. And now you're multiplying two positions together and subtracting a bunch of them and asking if they're positive or negative and the problem with that is that you will halve your available precision. So even doubles will only get you 52/2 = 26 bits of precision, which is rubbish. And experience with BSP trees has shown that they're extremely intolerant of precision errors. The solution for this case is to instead store a point on the plane and the plane's normal - then you can do a position subtract first, THEN the dot-product - and that works just fine. Otherwise, even decently-sized levels are going to end up in agony (Michael Abrash has a story exactly like this about Quake 1 - they had tiny tiny levels by today's standards, and they still had problems!). Using the classes will force you to restrict yourself to just taking deltas between two positions, and will highlight problems like this and allow you to rethink them before they happen. TL;DR * Don't start your times at zero. Start them at something big - a couple of hours at least. Ideally do the same with position - by default set your origin a long way away. * float32 has precision problems even in normal circumstances - you only get about seven digits of precision. * float64 is a tricky beast to use in practice and can cause unexpected speed problems. * Variable precision is a nightmare for reproducibility and testability - even with float64. * Fixed-point may be old tech, but it works very well for absolute time and position. * Help yourself guard against precision-cancellation problems by not exposing absolute time and position to most parts of your app. Any time you think you need them, you're almost certainly approaching the problem wrong.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back