Closed cdvr1993 closed 1 year ago
Are there any benchmarks you could run?
Are there any benchmarks you could run?
Added the benchmark code. The result is here:
@cdvr1993 I ran benchmarks with and without graph changes (keeping just call & zap changes) on my laptop.
Without graph changes:
$ go test -bench=BenchmarkMiddlewareHandle -benchmem
goos: darwin
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkMiddlewareHandle-8 2292955 517.4 ns/op 192 B/op 2 allocs/op
PASS
ok go.uber.org/yarpc/internal/observability 2.240s
With graph changes (this PR as-is)
$ go test -bench=BenchmarkMiddlewareHandle -benchmem
goos: darwin
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkMiddlewareHandle-8 2327730 515.1 ns/op 192 B/op 2 allocs/op
PASS
ok go.uber.org/yarpc/internal/observability 2.289s
There isn't a significant difference in both.
@DheerendraRathor Ok, I updated it. It seems the benchmark is faster as you said. Although stack usage is greater:
middleware.go:183 0x62d6f 4881ec60020000 SUBQ $0x260, SP
vs
middleware.go:183 0x5c728 4881ec50010000 SUBQ $0x150, SP
If we want to have more common and readable code removing only fields from call is the way to do it.
If we want to reduce as much as possible stack usage in the middleware function, the original PR is the best. I'll let you to decide.
@cdvr1993 could you post the final results?
@cdvr1993 I think there is some miscommunication here between comments. I've no objection in converting all call methods to pointer receiver unless there is huge gain in performance with value receivers which I don't think is the case. This will also help you in stack use reduction.
I had objection in changes to graph.begin
API which IMO reduced the code readability and maintainability.
I can't seem to look at the history of the PR, but what is the difference in the benchmarking numbers between the current changes vs with pointer receivers?
Thanks for this @cdvr1993!
I can't seem to look at the history of the PR, but what is the difference in the benchmarking numbers between the current changes vs with pointer receivers?
Thanks for this @cdvr1993!
Negligible. In terms of performance they are pretty much the same. The only advantage would be to reduce stack usage, but most of the usage is gone by removed the zap fields.
@jronak 0x988 vs 0x260, so 1.8KB
@rabbbit last benchmark results
go test ./internal/observability/... -bench . -count=10
goos: linux
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: AMD EPYC 7B13
BenchmarkMiddlewareHandle-96 1429688 844.9 ns/op
BenchmarkMiddlewareHandle-96 1443886 828.4 ns/op
BenchmarkMiddlewareHandle-96 1442847 828.4 ns/op
BenchmarkMiddlewareHandle-96 1451000 821.1 ns/op
BenchmarkMiddlewareHandle-96 1442511 833.8 ns/op
BenchmarkMiddlewareHandle-96 1436607 834.3 ns/op
BenchmarkMiddlewareHandle-96 1462096 815.1 ns/op
BenchmarkMiddlewareHandle-96 1425666 816.2 ns/op
BenchmarkMiddlewareHandle-96 1450243 824.1 ns/op
BenchmarkMiddlewareHandle-96 1458702 815.5 ns/op
PASS
ok go.uber.org/yarpc/internal/observability 20.577s
@rabbbit done.
Do you know why the allocations were reduced too?
@biosvs updated it. Thanks.
TLDR; Reduce stack usage from the rpc handler function from 2520 bytes to 608 bytes.
UPDATE: We decided to keep the methods with value receivers which increased stack usage a bit (~200 bytes), but still gave us ~1.9KB savings. The only change done was to remove the [10]zap.Field from the call struct and allocate that directly in the method that was using it.
Original description Currently one of our services has the following stack trace:
We noticed that the Handle function is consuming 2.5KB of stack usage. Although, it is preferred to use stack vs heap, >2KB seems like a problem. This could negatively impact an application if it frequently causes stack expansion. Additionally, it requires more memory to hold it.
The problem was with the call struct:
After calling unsafe.Sizeof(call{}), it shows that it consumes 736 bytes. In this case the compiler decided to give stack space for 3 copies of call:
This looks somewhat inefficient so we decided to fix it, to have only one copy at the Handle() level. The required changes were:
After these changes we can quickly see the differences from the assembly:
From
To
Finally our application reports:
424 vs the original 2520.