yoanlcq / vek

Generic 2D-3D math swiss army knife for game engines, with SIMD support and focus on convenience.
https://docs.rs/vek
Apache License 2.0
282 stars 32 forks source link

Add a feature that weakens the `MulAdd` bound for matrix multiplication #69

Open zesterer opened 3 years ago

zesterer commented 3 years ago

MulAdd is an internal trait but may be implemented in terms of traits in core. The ability to relax this bound would allow Mat4 to work with non-standard numeric types.

yoanlcq commented 3 years ago

That reminds me of something. Very recently, for a reason I don't remember, I ended up reading the definition of FMA on Wikipedia, and I realized something important :
Strictly speaking, mul_add(a, b, c) is not the same as a * b + c. I used to believe that this was mostly a matter of performance, but in fact mul_add provides an extra guarantee about the precision of the result which is enough to treat it as a different operation.

The important part I realized is, when a user calls mul_add(), they expect the guarantees of a real FMA operation (at the hardware-level); because it they didn't, they would have written a * b + c as usual.

So I need to check what core's MulAdd opinion is on this, but if my assumption is correct, MulAdd does not make sense for types which do not have hardware support for FMA.

I'm not fundamentally against a generic MulAdd, but I realized it may become a matter of "are we misinforming the user". I'll look into this.

zesterer commented 3 years ago

Perhaps there could be a convenience type like struct MAdd<T>(T); that adds a MulAdd implementation in terms of Mul and Add to allow use with non-standard types?

yoanlcq commented 3 years ago

Btw, since the fix for #73 that I've published just now, vek just re-export's num-trait's MulAdd instead. I think it's for the better...

Both std and num-traits seem to agree on the following :

Computes (self * a) + b with only one rounding error, yielding a more accurate result than an unfused multiply-add. Using mul_add may be more performant than an unfused multiply-add if the target architecture has a dedicated fma CPU instruction

According to this, it looks like they guarantee the only one rounding error part, what they don't guarantee is acceleration.

I didn't take the time to dig into intrinsics::fmaf64() to see what it really does, but as far as I can infer from this documentation (and that's probably what we should stick to), they would be willing to have a software implementation of FMA if the CPU doesn't have it; and that implementation is probably not a * b + c since it may cause two rounding errors.

Though, looking back at your initial proposal, I wonder : is it only Mat4 that suffers from the issue?

Also, if the goal if for MulAdd to work with non-standard types, then I would suggest poking at the authors of these types to make them implement num-trait's MulAdd, and it should just work, right ?

zesterer commented 3 years ago

Thanks for making this switch! The reason I opened the issue is that I'm working on a platform without any real acceleration hardware or intrinsics at all, to falling back to a soft impl seems perfectly reasonable to me. I think this will now work fine.