zeebo / xxh3

XXH3 algorithm in Go
BSD 2-Clause "Simplified" License
406 stars 20 forks source link

fix: mark Hash arg as noescape #19

Closed vmihailenco closed 2 years ago

vmihailenco commented 2 years ago

Go code that uses unsafe.Pointer requires this hack or []byte arg always escapes to heap. cespare/xxhash uses asm with go:noescape instead.

See https://github.com/golang/go/issues/23382 for an example from stdlib

benchmark                  old ns/op     new ns/op     delta
BenchmarkHashEscape-12     27.1          4.61          -83.03%

benchmark                  old allocs     new allocs     delta
BenchmarkHashEscape-12     1              0              -100.00%

benchmark                  old bytes     new bytes     delta
BenchmarkHashEscape-12     8             0             -100.00%
zeebo commented 2 years ago

Can you demonstrate with a test case or benchmark how this leaks the argument? Note, the Hash and HashString methods are carefully designed to be inlined by the compiler.

vmihailenco commented 2 years ago

I've added a benchmark. PTAL

vmihailenco commented 2 years ago

How about adding it to HashSeed, Hash128 and Hash128Seed?

@zeebo what do you think?

vcabbage commented 2 years ago

To my understanding, this change appears actually unsafe. It's intentionally breaking the unsafe.Pointer rules, specifically rule 3, to hide a pointer. In https://go.dev/cl/86976/ the resulting pointer is only used for the copy check and is never dereferenced.

I'll note that the escape seems to be due to the function indirection and can be avoided by restructuring as below. Though it's unclear to me whether the indirection is itself an optimization.

func Hash(b []byte) uint64 {
    if len(b) <= 16 {
        return hashSmall(*(*str)(ptr(&b)))
    }
    return hashMed(*(*str)(ptr(&b)))
}
klauspost commented 2 years ago

It seems this is a no-win situation. With the current implementation, it escapes, with @vcabbage the call is not inlined, which penalizes byte-slices.

What about going another direction and merge hashSmall and hashMed. We already have a length check, and it has to be performed anyway, ending up with...

func Hash(b []byte) uint64 {
    return hashMed(*(*str)(ptr(&b)))
}

// Hash returns the hash of the string slice.
func HashString(s string) uint64 {
    return hashMed(*(*str)(ptr(&s)))
}

func hashMed(s str) (acc u64) {
    p, l := s.p, s.l

    switch {
    case l <= 16:
        switch {
        case l > 8:
...

This works fine to avoid the escape:

BenchmarkHashEscape-32      376313569            3.205 ns/op           0 B/op          0 allocs/op

So far so fine. However, even it appears that it causes a slight regression in small payloads on my machine:

Before:
BenchmarkFixed64/8/default-32           423928652            2.781 ns/op    2876.96 MB/s

After:
BenchmarkFixed64/8/default-32           389967436            3.086 ns/op    2592.35 MB/s

Inspecting asm doesn't reveal anything to me at least. I tried aligning the jump target for the < 16, so I can't really explain it. Could also just be my specific processor.

zeebo commented 2 years ago

I'm playing around with removing the call through the function pointer as well. One thing I'm going to experiment with is in go1.18, switches are potentially compiled differently as jump tables sometimes. Maybe some of the losses can be recouped there.

vmihailenco commented 2 years ago

It's intentionally breaking the unsafe.Pointer rules

My understanding is that this change can't interfere with unsafe rules, because it does not change the generated code in this package. It only hacks escape analysis on the calling side of Hash func to not allocate the passed arg on the heap. If the arg is not used/referenced by xxh3 after Hash function exits, this change should be safe.

vcabbage commented 2 years ago

I'm not an authority on this, so take my thoughts with the appropriate grain of salt 🙂

My understanding is that this change can't interfere with unsafe rules, because it does not change the generated code in this package.

The noescape function doesn't generate any code directly, but changing escape analysis can change the generated code. Heap allocating or not is part of the code.

It only hacks escape analysis on the calling side of Hash func to not allocate the passed arg on the heap. If the arg is not used/referenced by xxh3 after Hash function exits, this change should be safe.

If there were no more references to b in the calling code, could it end up being collected while the hash function is executing? I don't know with any certainty, but with the unsafe rules broken it wouldn't surprise me.


Apparently the usage in strings.Builder has caused an issue. I imagine the experienced folks that used it in strings.Builder weren't expecting that, highlighting that the USE CAREFULLY! comment shouldn't be overlooked.

At the very least, if it's possible to avoid the allocation without using such a tricky function that seems preferable.

zeebo commented 2 years ago

So I experimented quite a bit with shuffling the functions around to avoid the call through the pointer, and here's a table summarizing the results:

size | 64 | 64 seed | 128 | 128 seed ---- | -- | ------- | --- | -------- 0 | -6.79% | ~ | -7.37% | ~ 1 | +16.13% | +6.59% | +6.16% | +0.13% 2 | +16.51% | +6.26% | +6.02% | +5.19% 3 | +20.22% | +7.20% | +8.84% | +2.49% 4 | +8.36% | +5.62% | +8.23% | +2.47% 8 | +8.63% | +5.36% | +8.18% | +2.50% 9 | +7.67% | +3.54% | +5.13% | +6.04% 16 | +7.70% | +4.29% | +5.07% | +5.99% 17 | -14.78% | -5.22% | -2.67% | -2.50% 32 | -14.78% | -5.12% | -4.10% | -3.07% 33 | -7.71% | -0.28% | -3.61% | -5.13% 64 | -7.66% | ~ | -3.64% | -4.94% 65 | -1.84% | +1.93% | -3.36% | -3.98% 96 | -2.06% | +2.09% | -4.21% | -4.14% 97 | -1.11% | +1.99% | -3.66% | -7.57% 128 | -1.39% | +1.98% | -5.44% | -5.47% 129 | -7.96% | +2.09% | -6.50% | -7.87% 240 | -7.21% | -1.12% | -5.80% | -6.59% 241 | -10.54% | -10.31% | -11.72% | -11.94% 241-AVX2 | -5.55% | -9.49% | -5.37% | -11.21% 241-SSE2 | -5.63% | -9.81% | -4.93% | -9.89% 512 | -14.08% | -11.17% | -12.21% | -10.57% 512-AVX2 | -4.99% | -9.42% | -5.16% | -8.47% 512-SSE2 | -4.51% | -8.58% | -3.14% | -6.90% 1024 | -13.21% | -10.98% | -13.44% | -11.05% 1024-AVX2 | -3.41% | -8.51% | -2.62% | -7.92% 1024-SSE2 | -3.06% | -8.32% | -2.60% | -5.57% 8192 | -9.90% | -10.41% | -12.90% | -11.01% 8192-AVX2 | -1.96% | -2.78% | -0.67% | -1.68% 8192-SSE2 | -0.48% | -1.92% | -0.93% | -1.86% 102400 | -5.89% | -6.81% | -10.05% | -6.98% 102400-AVX2 | -1.74% | -1.95% | -1.59% | -0.87% 102400-SSE2 | -3.77% | -0.62% | +1.17% | -0.59% 1024000 | -5.62% | -6.46% | -8.96% | -5.89% 1024000-AVX2 | -1.41% | -1.55% | +0.96% | +1.23% 1024000-SSE2 | -3.07% | -1.22% | +3.10% | +2.82% 10240000 | -4.24% | -6.19% | -9.74% | -7.33% 10240000-AVX2 | -1.86% | +6.80% | -5.89% | ~ 10240000-SSE2 | ~ | ~ | -3.93% | -4.88% 102400000 | -4.29% | -5.04% | -10.29% | -6.98% 102400000-AVX2 | -8.43% | -11.67% | ~ | -5.40% 102400000-SSE2 | -3.56% | -2.71% | -5.17% | -5.29%

There's a lot to like about that: sizes above 16 are pretty much entirely wins, sometimes on the order of 15%. On the other hand, sizes smaller than 16 are sometimes worse. But while the percents are large, in absolute terms, they account for less than 1ns. Like, the size 3 that gained 20% went from 2.8ns/op to 3.35ns/op.

So I think this is acceptable, and I'm gonna go with it rather than the "noescape" trick. As noted above, the output code being the same doesn't imply the runtime metadata the garbage collector uses is, and the fact that it can be attributed to causing some invalid GC state makes me even more wary.

Thanks for noticing this and providing the test case to demonstrate it.

vmihailenco commented 2 years ago

Thanks for fixing this.

Hard to say if this would help with performance, but I wonder why do you convert everything to str and not use []byte as is? That would remove unsafe entirely for the common case of hashing bytes.

For strings, you could use unsafe to cast string to []byte.

zeebo commented 2 years ago

The code used to pass two parameters: the data pointer and length. It was changed to pass the str struct because that made the inliner happy. Now that the body of the functions are simpler again, it could go back to doing more "normal" unsafe stuff, but there is no difference.

It would be incorrect to use unsafe to interpret a string as a []byte. A []byte is larger than a string: it has a data pointer, a length and a capacity, whereas the string only has the data pointer and length.

str has the same memory layout as a string and the same memory layout as the first two fields of a []byte, so unsafe casting both to the same thing allows sharing the same code to process both.

vmihailenco commented 2 years ago

Now that the body of the functions are simpler again

I meant to use []byte and restore optimization for small sizes. Use []byte, no unsafe, and no indirection.

vmihailenco commented 2 years ago

It would be incorrect to use unsafe to interpret a string as a []byte

Every second library has a function to cast string to []byte using unsafe. Perhaps you mean slower instead of incorrect.

// Bytes converts string to byte slice.
func Bytes(s string) []byte {
    return *(*[]byte)(unsafe.Pointer(
        &struct {
            string
            Cap int
        }{s, len(s)},
    ))
}
zeebo commented 2 years ago

I don't understand. The main performance loss is because the check for smaller/larger than 16 bytes is no longer inlined. It was inlined because of the way it called through a function pointer. None of that has to do with unsafe, and I don't see how keeping the value as a []byte would help.

I meant doing *(*[]byte)(unsafe.Pointer(&someString)) is incorrect. I see no benefit to using a []byte as opposed to a string in this case because the function does not mutate it and does not need to pass it to something that takes a []byte.