Nested calls makes vek unusable in debug-buils for real-time applications.

yoanlcq / vek

Generic 2D-3D math swiss army knife for game engines, with SIMD support and focus on convenience.

https://docs.rs/vek

Apache License 2.0

282 stars 32 forks source link

Nested calls makes vek unusable in debug-buils for real-time applications. #53

Closed koalefant closed 4 years ago

koalefant commented 4 years ago

Hi, sorry for overly negative tone, but I am profiling some code using vek and I can't believe my eyes:

 - 75.82% basic::sd_terrain                                                                                          ▒
      - 39.16% basic::sd_line                                                                                          ▒
         - 21.87% vek::vec::repr_c::vec2::Vec2<T>::dot                                                                 ▒
            - 19.60% vek::vec::repr_c::vec2::Vec2<T>::sum                                                              ▒
               - 17.80% core::iter::traits::iterator::Iterator::sum                                                    ▒
                  - <f32 as core::iter::traits::accum::Sum>::sum                                                       ▒
                     - core::iter::traits::iterator::Iterator::fold                                                    ▒
                        - 11.89% core::iter::traits::iterator::Iterator::try_fold                                      ▒
                           - 7.72% <vek::vec::repr_c::vec2::IntoIter<T> as core::iter::traits::iterator::Iterator>::nex▒
                              - 2.72% <vek::vec::repr_c::vec2::Vec2<T> as core::ops::deref::Deref>::deref              ▒
                                 - vek::vec::repr_c::vec2::Vec2<T>::as_slice                                           ▒
                                    - 1.25% vek::vec::repr_c::vec2::Vec2<T>::as_ptr_priv                               ▒
                                       - vek::vec::repr_c::vec2::Vec2<T>::is_packed                                    ▒
                                            0.56% core::ptr::const_ptr::<impl *const T>::wrapping_of

Are there really 11 nested calls in a Vec2::dot product, which could be a.x * b.x + a.y * b.y?

yoanlcq commented 4 years ago

Hi,

It's fine, when put this way I must agree that it is quite surprising. Looks a bit like these bloated Java enterprise frameworks with crazy call stacks for the simplest tasks.

It is also true that I didn't profile the library; I focused first on features, occasionnally using the Godbolt compiler explorer on small snippets to confirm that optimized release assembly was as expected.

First of all I ought to get rid of assert!(self.is_packed());, which is nonsensical. I don't even remember why I wrote it, it's only actually important for SIMD-enabled matrices, not vectors.

Second, but this is the tricky part, there appears to be overhead due to the Sum trait. This reminds me of the time when Rust's for loops using iterators were slower than loops using plain integers, then the Rust team did some magic and iterators are now considered zero-cost.

The dot() function (and the whole library really) was written in a way that trusts that Rust's abstractions are "zero-cost" in that they would match, or outperform, equivalent hand-written code.
If you are able to see all of those nested calls, then that means that they were not inlined, and performance is therefore not as intended (or at least, as I intended).
I believe this is to be expected in debug mode, hopefully not in release mode.

It is definitely possible to avoid using the Sum trait and hand-write the sum part. However, that would defeat the purpose of Sum, which was designed for this exact operation, and is part of what the language advertises as "zero-cost"...
Of course, we'd rather it was, well, as fast as a.x * b.x + a.y * b.y. It's a pity we couldn't be both elegant and fast, is all.

Which optimization settings are you using for profiling? Not to doubt your benchmark, I am just asking because as mentioned, the calls do not seem to be inlined as intended.

yoanlcq commented 4 years ago

Sorry, I didn't see you mentioned Debug builds in the issue's title.

Unfortunately, from what I gather, a lot of libraries in Rust have this problem. I do not remember where I saw this, it was some post by tomaka on the Rust forums, or Twitter, I do not remember. The situation appears to be basically :

Debug compiles faster than release, but is dog slow;
Release compiles slower and is somewhat fast.

I've never got to that stage, i.e I've never encountered a situation where something in the Debug build was way too slow. But I'm not that suprised that it happens, given the state of the language.

I figure if we want to "fix" the dot product, then we might as well fix the whole language; like "stop using iterators and go back to plain old pointer offsets";

OR if it is possible, tell Rust that vek should always be compiled in super-fast release mode; after all, it is performance critical and definitely doesn't need debug information. I am not aware of a way to do that, but I'm certain this was a requested feature.

yoanlcq commented 4 years ago

I think this is the way one should solve this : Optimizing dependencies

Just override the optimization level of vek. Thus you still get to test debug builds, but with vek running at the intended speed.

We can't (easily) get rid of the overhead incurred by nested calls in debug builds; I mean, dot is definitely not the only function suffering from this issue. Anything using traits is possibly negatively impacted, and there's no way around it, because the compiler is forced to preserve symbol information, and therefore it cannot generate efficient code. This is only somewhat taken for granted because nobody ships debug builds.

Please do let me know if this solves your problem.

yoanlcq commented 4 years ago

I'll also mention this here : the repository includes an example to inspect the generated optimized assembly for a snippet (documented in the source) : examples/inspect-asm

EDIT: Results with:

#[no_mangle] pub fn v2_f32_dot_reprc (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.dot(b)  }
#[no_mangle] pub fn v2_f32_dot_simd (a: repr_simd::Vec2<f32>, b: repr_simd::Vec2<f32>) -> f32 { a.dot(b)  }
#[no_mangle] pub fn v2_f32_dot_reprc_handmade (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.x * b.x + a.y * b.y  }

v2_f32_dot_reprc:
    mulss   xmm0, xmm2
    mulss   xmm1, xmm3
    xorps   xmm2, xmm2
    addss   xmm0, xmm2
    addss   xmm0, xmm1
    ret

v2_f32_dot_simd:
    movss   xmm1, dword ptr [rcx]
    movss   xmm2, dword ptr [rcx + 4]
    mulss   xmm1, dword ptr [rdx]
    mulss   xmm2, dword ptr [rdx + 4]
    xorps   xmm0, xmm0
    addss   xmm0, xmm1
    addss   xmm0, xmm2
    ret

v2_f32_dot_reprc_handmade:
    mulss   xmm0, xmm2
    mulss   xmm1, xmm3
    addss   xmm0, xmm1
    ret

To be honest, it hurts to look at. Even in release mode, the "handmade" version beats the current dot implementation, which is unfortunate, I hoped they would generate the same code.
I should take some time one day to really look into low-level optimizations like these, but there has never been a real incentive for me to do so.

Re-EDIT (sorry. I'm re-discovering stuff as I go). Looks like the compiler is messing with calling conventions. If we enforce the "C" calling convention like so :

#[no_mangle] pub extern "C" fn v2_f32_dot_reprc (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.dot(b)  }
#[no_mangle] pub extern "C" fn v2_f32_dot_simd (a: repr_simd::Vec2<f32>, b: repr_simd::Vec2<f32>) -> f32 { a.dot(b)  }
#[no_mangle] pub extern "C" fn v2_f32_dot_reprc_handmade (a: Vec2<f32>, b: Vec2<f32>) -> f32 { a.x * b.x + a.y * b.y  }

then we get these fellows:

v2_f32_dot_reprc:
    movd    xmm0, ecx
    shr rcx, 32
    movd    xmm1, ecx
    movd    xmm2, edx
    mulss   xmm2, xmm0
    shr rdx, 32
    movd    xmm3, edx
    mulss   xmm3, xmm1
    xorps   xmm0, xmm0
    addss   xmm0, xmm2
    addss   xmm0, xmm3
    ret

v2_f32_dot_simd:
    movaps  xmm0, xmmword ptr [rcx]
    mulps   xmm0, xmmword ptr [rdx]
    xorps   xmm1, xmm1
    addss   xmm1, xmm0
    shufps  xmm0, xmm0, 229
    addss   xmm0, xmm1
    ret

v2_f32_dot_reprc_handmade:
    movd    xmm0, ecx
    shr rcx, 32
    movd    xmm1, ecx
    movd    xmm2, edx
    mulss   xmm2, xmm0
    shr rdx, 32
    movd    xmm0, edx
    mulss   xmm0, xmm1
    addss   xmm0, xmm2
    ret

What I make out of it is that the generated assembly will greatly vary depending on what the compiler can assume, and I think we can trust its decisions. It remains true that I should consider optimization efforts.

koalefant commented 4 years ago

I am sorry to cause so much concern, the issue was written out of frustration and intens no harm.

I have measured now exactly time difference between different builds with my workload (if you are curious it is this loop https://github.com/koalefant/circle2d/blob/911399563bf1c893ff49c9d2a56e1865a927f1ea/examples/basic.rs#L537, generates background image in https://koalefant.github.io/circle2d/, ):

Debug - 4794.9 ms Debug with optimized deps - 4743.1 ms Release - 47.4 ms

Deps are optimized with these options:

[profile.dev.package."*"]
opt-level = 2
debug = 1

I realize that it may be a bit too much to demand comparable performance in Debug, but for interactive applications 100x difference pretty much robs me of Debug builds. If I target the game for 60 FPS and I am CPU-bound, 100x slowdown brings me way below 1 FPS. It becomes non-interactive anymore and I can't used it for testing. You may argue that opt-level = 1 would improve the situation, but longer compilation hurt iteration times. Currently I end up disabling debug symbols for debug build just so my build times are lower. :disappointed:

I realize that situation is common for other libraries, but I hope we can do better than 100x.

yoanlcq commented 4 years ago

Seeing the code in context does indeed help.

I guess you have considered that option already, but I would try something like :

[profile.dev.package.vek] # Note, no wildcard
opt-level = 3
debug = false

# How would that work with the existing wildcard, I'm not sure, but can be figured out

Because the timings you obtained...

Debug - 4794.9 ms Debug with optimized deps - 4743.1 ms Release - 47.4 ms

... lead me to believe that "optimized deps" had little to no effect. Isn't there an option somewhere you could have missed? I suspect debug = 1 is the culprit; I understand you might need it for other dependencies, but it's really not needed for vek (or any "simple enough" math library I guess).

You may argue that opt-level = 1 would improve the situation, but longer compilation hurt iteration times

I believe you meant opt-level = 3? Either way I am surprised this would lead to longer compilations, I mean, compilations other than the first one since changes in settings. Once a crate is compiled for a project, it is not recompiled every time, or is it?

I realize that situation is common for other libraries, but I hope we can do better than 100x.

I would suggest the following :
Try using your own vec2_dot() function which really does just a.x * b.x + a.y * b.y (just for profiling your main bottleneck, not the whole project). Then profile again and see if there are any other major bottlenecks. From intuition, I would suspect there are.

In the meantime, I'll see what I can do. Since replacing Sum by Add is a breaking change, I'm a bit reluctant to doing it, but I guess it can't be helped.

I guess you could also clone vek locally, use it in your project, and try to remove the assert!(self.is_packed()) lines, if any, in src/vec.rs. This one is an easy speedup at least, which will be in a next version.

yoanlcq commented 4 years ago

I have published version 0.11 which should fix your issue. This time, sum() literally expands to v.x + v.y + v.z + ... thanks to macro magic, which has a positive impact on dot() which is implemented as (u * v).sum(), which in turn makes a lot of dependant functions faster.

A lot of similar functions (anything that looks like a reduce operation) have benefitted from a similar optimization. As a result, they are definitely as fast as possible. (at least for repr_c. I make no guarantees that repr_simd is super clever and figures out the best CPU instruction to use).

Please let me know if that helped.

koalefant commented 4 years ago

Hi, Thanks for the quick update! I have tried implementing manual dot/magnitude, here is what I got: Debug: vek-0.10.4: 4759.1 ms vek-0.11.0: 1610.0 ms manual: 781.1 ms Release: vek-0.10.4: 46.8 ms vek-0.11.0: 43.3 ms manual: 24.8 ms

3x time improvement in 0.11.0 for Debug! Not bad!

[profile.dev.package.vek] # Note, no wildcard
opt-level = 3
debug = false

Does not seem to have a measurable effect. Perhaps inlined code gets monomorphized in my crate? Not sure.

yoanlcq commented 4 years ago

[profile.dev.package.vek] # Note, no wildcard opt-level = 3 debug = false

Does not seem to have a measurable effect. Perhaps inlined code gets monomorphized in my crate? Not sure.

That would make sense... It's a bit of a pity in any case.

One thing that could still be done, I guess (but is less than desirable), is to split your own package in two crates, one being the "math core" (or whatever) which does all the intensive operations and does not really need debug information (which you would build with opt-level=3), and the other being the "main package" which eventually re-exports and use the "math core", and which could be compiled with or without optimizations.

I could see that working, because the "math core" code instanciates vek templates with specific types, therefore there is no way the compiler could miss optimization opportunities.
Then, if ever you do end up needing to debug the "math core", you can always temporarily change the build setting; but the intent is that is almost never happens.

Typically the whole for loop in https://github.com/koalefant/circle2d/blob/911399563bf1c893ff49c9d2a56e1865a927f1ea/examples/basic.rs#L537, along with the functions it calls, would be moved into the "math core".

It's definitely a possibility. Where I work at, the physics engine and similar math-intensive projects are always compiled in release mode; that means we can't debug them, but OTOH if we don't do that, the games are simply unplayable in debug mode. We've come to accept it as a fact of life, and it's not even Rust, it's C++.

In any case, I see no way to further optimize dot() and friends, except possibly specifying the #[inline] attribute, which I'll try next.
There is also a way to optimize SIMD operations using the platform-intrinsics feature (allowing to use simd-add and friends), which I haven't integrated yet, but is quite a big change in infrastructure.

Next you'll definitely have to consider other "standard" optimizations, like caching results (+ not recomputing what hasn't changed), optimizing signed-distance queries with a BVH (for instance), parallelizing work accross multiple threads, or even moving to fragment/compute shaders.
These all look suitable for what you are doing, but I haven't looked that far into the provided code.

I mean, of course it's better when the math library is fast to begin with :) but do be aware of these options.

koalefant commented 4 years ago

I already do splitting, but I always end up having code that uses a lot of math, and still desires to be debuggable. The initialization code on itself is not a problem it is more of an illustration of other performance degradation that is harder to demonstrate. Thank you for your time!

yoanlcq commented 4 years ago

Anytime. I'll close the issue now, but feel free to add to the discussion as you see fit.