Poor code generation for std.mem.eql

ManDeJan commented 3 years ago

I discovered when I was benchmarking some parser code I was working on. As the title states, the current implementation of std.mem.eql is very naive:

https://github.com/ziglang/zig/blob/bf67a3fdc9efea7126058b21e687c24868bc1268/lib/std/mem.zig#L485-L493

It iterates over slices byte by byte which causes the compiler to generate a bunch of compare and jump instructions: https://godbolt.org/z/Mobabsq8a

One of the most frequent use cases (95%+ in the standard library) for std.mem.eql is comparing a string with a compile time known string. Seeing as comparing memory is so common; especially in high performance code like parsers (and a bunch of other standard library functions). I think it is worth having a less naive more optimized implementation of std.mem.eql. I identified a few points which could improve the implementation:

Remove the if (a.ptr == b.ptr) return true; line. (I think this is a premature optimization) It causes a extra comparison and branch in all compilations of the eql function but I can't think of any scenarios where this would acutally be a major time save.
Convert every slice to a slice of native int type if the alignment allows for it, this causes less loads and compares. Though you will need a clever to deal with the remainder by using smaller int types and/or comparing some bytes twice.
If possible/desirable do the above but with the SIMD registers.

I was working on a implementation that did some of the above but haven't been able to get it done yet. There are so many cases to account for and some of the optimizations you can only make if you know something about the length of the comparison at compile time. I think having a dedicated @memcmp akin to @memcpy and @memset might be a solution, though I'm not sure how much work adding that would be :-)

marler8997 commented 3 years ago

If you decide to work on this, I think you can leverage this repo to measure any performance improvements you may gain: https://github.com/ziglang/gotta-go-fast

Also if there's one thing I've learned, it's that modern processors are very unpredictable when it comes to performance. Things you think might make something faster may actually make things slower. For example, a simple implementation, even if it does more work, may run faster if it's encoded in fewer instructions due to the icache. One of the slowest things a computer does nowadays is fetch memory from ram, so any extra instructions and/or data you need to access, espcially if it's not within the same cache line can have drastic affects on performance.

In any case, that's why it's good for us to use benchmarks where we can measure whether something results in a performance improvement. In your case here, the compiler test suite might be a good benchmark for this.

Jarred-Sumner commented 3 years ago

Relevant snippet from @fengb: https://github.com/fengb/fundude/blob/ad1783281625f2b7b05fbced8ca358d79ff51b7e/src/util.zig#L105-L110

cryptocode commented 2 years ago

I just ran into this and had to call out to __builtin_memcmp to get performance on par with the ported C code (after noticing the std.mem.eql code standing out prominently in the profiler)

In short, swapping out std.StringHashMap with std.HashMap with a context pointing to a __builtin_memcmp wrapper for the eql implementation increased performance by 20%.

wooster0 commented 2 years ago

FWIW there's a memcmp.zig (https://github.com/ziglang/zig/blob/master/lib/compiler_rt/memcmp.zig) that seems to be completely unreferenced in the entire codebase (dead file) and it seems like it could be used for a @memcmp builtin.

ziglang / zig

Poor code generation for std.mem.eql #8689