https://gcc.godbolt.org/z/YhrEcb7nz
Am I correct assuming that fmadd and vmul/vadd versions might not give me the exact same results (down in Test)? (Double checking.. )
... thanks for confirmations.. Practice agrees as well and sometimes the sort will explode :) (this was a copy-pasted bit of some of the <algorithm> sort code)
@msinilo I think you'd probably have to check the manual for your specific CPU to 100% know, but it's probably safe to assume that there's a potential precision loss due to a round between the vmul/vadd that may not exist in the fmadd.
@msinilo One does round the intermediate result, the other one does not.
@msinilo
It is my understanding / assumption for that to be the case given that the fused multiply and add would do it all within internal registers and not need to write it out to memory. So there would be no rounding applied to the internal values.
@msinilo is fp:fast generally worth the trouble it causes? It could lead to even simpler cases like compiler turning (a+b)+c into a+(b+c) which is not the same. And yeah, sorting predicates can and will go wrong with things like that.
@solidangle @aras @msinilo broken is probably tough for the team but maybe /fp:imprecise or /fp:nondeterministic or /fp:imfeelinglucky.
@msinilo @TomF @aras @rygorous I argued for the inclusion of fast math in rust for a while, but it's the wrong feature for the problem it's trying to solve; instead of a hammer across the whole codebase I think more localized solutions would be better. We currently run rust as-is with strict math ops, and our shaders too. I suspect we take a hit for it, but I'll only disable it if we're ever /really/ desperate for performance. For now, sidestepping a whole class of bugs is much more preferable.
@JasperBekkers @msinilo @aras @rygorous It might be interesting to have a compiler look at your code and tell you where it thinks something could be improved, and then you'd manually refactor things until it was happy that it couldn't improve it any more?
@TomF @JasperBekkers @msinilo @aras @rygorous my take on fast math is similar (evil!), but this proposed solution IMO would be strictly worse (for anything that is not write-once or for experts) - reading to unreadable, unclear code. I want something like local fast math with local decorators.
Halide has some partial solutions for it, where you can ask some functions / expressions to be evaluated at a specific point and folded out.
@BartWronski @TomF @JasperBekkers @msinilo @aras I'd be happy for something local like function or even scope-level annotations of "feel free to reassociate this" but "fast math" is an incredibly blunt tool and violates most intuitive notions of what a function even is, in a way that no other optimizations do
@BartWronski @TomF @JasperBekkers @msinilo @aras e.g. with fast math, you can have a pure function f, x==y, but f(x)!=f(y) when they're evaluated in different contexts, which is a huge departure from language semantics we would not tolerate elsewhere
@rygorous @BartWronski @TomF @msinilo @aras Full scoped fast math some issues; e.g. `#[fast_math]{ sqrt(1.0) }` would still be problematic in its results. Yes you've been clear during the operation, but stil... Scoping it down to specific optimizations (contraction/reassociation/auto reciprocals) may be useful but could be done manually or as a lint. (1/2)
@rygorous @BartWronski @TomF @msinilo @aras That way it's more explicit for the reader. Some others (approximations) could be through done explicit function calls, however others (non signed zeros, not assuming nan/inf) feel to me like they'd be appropriate for scope blocks. (2/2)
@JasperBekkers @BartWronski @TomF @msinilo @aras I was assuming you would specify what was allowed, fast math is too big an umbrella anyhow.
@rygorous @BartWronski @TomF @msinilo @aras My point exactly , it I'm also trying to figure out a way to work with it; fastmath's usefulness is in a large part as a "code is slow, please make it fast with minimal investment on my part" kinda tool. Giving up some of that convenience might pave a way to also trade away some of its downsides.
@JasperBekkers @rygorous @BartWronski @msinilo @aras Hence my suggestion of "compiler-guided optimisations". It suggests stuff, you decide whether or not to add those annotations.
@JasperBekkers @rygorous @BartWronski @msinilo @aras We sort of have some of that already with the compiler saying "hey did you actually mean to use a double here?"
@TomF @rygorous @BartWronski @msinilo @aras I see where you're coming from, sort of, on a high level, but I'm not sure how it would actually turn out in practice. e.g. for reassociation, reciprocals or contraction maybe it can suggest a way to rewrite the equations for you, but after that it would feel like this quickly falls apart. Would you want the compiler to suggest using approximations to functions of which it can't likely know the input? (1/2)
@TomF @rygorous @BartWronski @msinilo @aras Can it know that if you tell it, "please assume no nans for this code" that things will get significantly faster. How would your prevents overloads of false positives or massive compiler spam (2/2)
@JasperBekkers @rygorous @BartWronski @msinilo @aras I'm suggesting that anything the compiler would normally decide to do for you with fastmath, it instead suggests adding an annotation to allow. That's all.
@BartWronski @JasperBekkers @msinilo @aras @rygorous The improvements it can make are really only interesting in tuned kernels anyway, so becoming write-only doesn't seem like that big a deal. I just don't see any alternative - it either becomes write-only in an explicit way, or it has Magical Transforms You Can't Predict - which is really just another way of being write-only.
@TomF @JasperBekkers @msinilo @aras @rygorous I referenced Halide not by accident - as it's the only language I know that makes writing hot kernels that are readable and rewriteable somewhat possible. Not perfect, doesn't fully deliver what it promised - but sadly, don't know of any other non-academic, shippable languages that even try to solve this problem.
@aras @msinilo absolutely not, absolutely not by default. Ok on a case by case basis on pieces of code, but the practice of "it's 'fast', so let's turn it on brrrr" is super evil and causing coworkers or your future self pain. Flaky, nondeterministic tests, subtle or not so subtle bugs, changes of functionality upon compiler upgrade or change of platform.
@aras @msinilo but the compiler used fma instead of mul and add is not necessarily a /fp:fast or -ffast-math thing - IEEE 754 allows that kind of optimization, and for example GCC will do it by default (without -ffast-math), it must be explicitly disabled with `-ffp-contract=off` if you don't want it
@msinilo Separately from fp:fast, it’s so very wrong that std::sort (most implementations on most platforms) can do out of bounds reads or writes on predicates that don’t conform to ordering requirements…
@msinilo Correct - "fused" multiply add does no intermediate rounding between the multiply and the add. Now, whether or not your compiler chooses to use the fused version or not - that's all compiler magic to figure out.
Also, dot products are fun even without fused multiply-add, because you don't know if it's doing (X*X+Y*Y)+(Z*Z) or (X*X)+(Y*Y+Z*Z) (or some other combo), which again because of rounding differences can produce different results.
@dominikg @msinilo Sometimes I also think this: https://cohost.org/tomforsyth/post/943070-a-matter-of-precisio