Improve stack usage of all the Handle functions

cdvr1993 commented 1 year ago

[X] Description and context for reviewers: one partner, one stranger
[X] Docs (package doc)
[X] Entry in CHANGELOG.md

TLDR; Reduce stack usage from the rpc handler function from 2520 bytes to 608 bytes.

UPDATE: We decided to keep the methods with value receivers which increased stack usage a bit (~200 bytes), but still gave us ~1.9KB savings. The only change done was to remove the [10]zap.Field from the call struct and allocate that directly in the method that was using it.

Original description Currently one of our services has the following stack trace:

10920
200 google.golang.org/grpc.(*Server).serveStreams.func1.2
888 google.golang.org/grpc.(*Server).handleStream
1000 google.golang.org/grpc.(*Server).processStreamingRPC
728 go.uber.org/yarpc/transport/grpc.(*handler).handle
496 go.uber.org/yarpc/transport/grpc.(*handler).handleUnary
544 go.uber.org/yarpc/transport/grpc.(*handler).handleUnaryBeforeErrorConversion
256 go.uber.org/yarpc/transport/grpc.(*handler).callUnary
544 go.uber.org/yarpc/api/transport.InvokeUnaryHandler
104 go.uber.org/yarpc/api/middleware.unaryHandlerWithMiddleware.Handle
160 go.uber.org/yarpc/internal/inboundmiddleware.unaryChain.Handle
136 go.uber.org/yarpc/internal/inboundmiddleware.unaryChainExec.Handle
2520 go.uber.org/yarpc/internal/observability.(*Middleware).Handle                  # We're fixing this one
136 go.uber.org/yarpc/internal/inboundmiddleware.unaryChainExec.Handle
... (omitting for privacy)

We noticed that the Handle function is consuming 2.5KB of stack usage. Although, it is preferred to use stack vs heap, >2KB seems like a problem. This could negatively impact an application if it frequently causes stack expansion. Additionally, it requires more memory to hold it.

The problem was with the call struct:

type call struct {
    edge    *edge
    extract ContextExtractor
        fields  [10]zapcore.Field

    started   time.Time
    ctx       context.Context
    req       *transport.Request
    rpcType   transport.Type
    direction directionName

    levels *levels
}

After calling unsafe.Sizeof(call{}), it shows that it consumes 736 bytes. In this case the compiler decided to give stack space for 3 copies of call:

The local variable of Handle().
The return value from call{}.
One for passing the call to the defer or calling any method on call.

This looks somewhat inefficient so we decided to fix it, to have only one copy at the Handle() level. The required changes were:

All methods for "call" must use pointer receiver to avoid allocating stack space.
begin() function can receive a pointer to the stack allocated call{} coming from Handle() (in this way we eliminate the copy for the return value).
Remove the [10]zap.Field from call{}. These fields are then stored in a buffer which cause the stack allocated at Handle() to be escaped to the heap.
Fix all Handle() functions (stream, unary, etc) to follow the same process.

After these changes we can quickly see the differences from the assembly:

From

TEXT go.uber.org/yarpc/internal/observability.(*Middleware).Handle(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:183     0x62dd4                 4881ec88090000                  SUBQ $0x988, SP
--
TEXT go.uber.org/yarpc/internal/observability.(*Middleware).HandleOneway(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:235     0x6374a                 4881ec680c0000                  SUBQ $0xc68, SP
--
TEXT go.uber.org/yarpc/internal/observability.(*Middleware).HandleStream(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:255     0x63f15                 4881ec38090000                  SUBQ $0x938, SP

To

TEXT go.uber.org/yarpc/internal/observability.(*Middleware).Handle(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:183     0x5c728                 4881ec50010000                  SUBQ $0x150, SP
--
TEXT go.uber.org/yarpc/internal/observability.(*Middleware).HandleOneway(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:241     0x5cf29                 4881ec50010000                  SUBQ $0x150, SP
--
TEXT go.uber.org/yarpc/internal/observability.(*Middleware).HandleStream(SB) gofile../home/user/go/go-no-monorepo/src/go.uber.org/yarpc/internal/observability/middleware.go
  middleware.go:263     0x5d553                 4881ec28010000                  SUBQ $0x128, SP

Finally our application reports:

8824
200 google.golang.org/grpc.(*Server).serveStreams.func1.2
888 google.golang.org/grpc.(*Server).handleStream
1000 google.golang.org/grpc.(*Server).processStreamingRPC
728 go.uber.org/yarpc/transport/grpc.(*handler).handle
496 go.uber.org/yarpc/transport/grpc.(*handler).handleUnary
544 go.uber.org/yarpc/transport/grpc.(*handler).handleUnaryBeforeErrorConversion
256 go.uber.org/yarpc/transport/grpc.(*handler).callUnary
544 go.uber.org/yarpc/api/transport.InvokeUnaryHandler
104 go.uber.org/yarpc/api/middleware.unaryHandlerWithMiddleware.Handle
160 go.uber.org/yarpc/internal/inboundmiddleware.unaryChain.Handle
136 go.uber.org/yarpc/internal/inboundmiddleware.unaryChainExec.Handle
424 go.uber.org/yarpc/internal/observability.(*Middleware).Handle
136 go.uber.org/yarpc/internal/inboundmiddleware.unaryChainExec.Handle
...

424 vs the original 2520.

rabbbit commented 1 year ago

Are there any benchmarks you could run?

cdvr1993 commented 1 year ago

Are there any benchmarks you could run?

Added the benchmark code. The result is here:

No assert

Now

With assert

Now

DheerendraRathor commented 1 year ago

@cdvr1993 I ran benchmarks with and without graph changes (keeping just call & zap changes) on my laptop.

Without graph changes:

$ go test -bench=BenchmarkMiddlewareHandle -benchmem
goos: darwin
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkMiddlewareHandle-8      2292955               517.4 ns/op           192 B/op          2 allocs/op
PASS
ok      go.uber.org/yarpc/internal/observability        2.240s

With graph changes (this PR as-is)

$ go test -bench=BenchmarkMiddlewareHandle -benchmem
goos: darwin
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkMiddlewareHandle-8      2327730               515.1 ns/op           192 B/op          2 allocs/op
PASS
ok      go.uber.org/yarpc/internal/observability        2.289s

There isn't a significant difference in both.

cdvr1993 commented 1 year ago

@DheerendraRathor Ok, I updated it. It seems the benchmark is faster as you said. Although stack usage is greater:

middleware.go:183     0x62d6f                 4881ec60020000          SUBQ $0x260, SP

vs

  middleware.go:183     0x5c728                 4881ec50010000                  SUBQ $0x150, SP

If we want to have more common and readable code removing only fields from call is the way to do it.

If we want to reduce as much as possible stack usage in the middleware function, the original PR is the best. I'll let you to decide.

rabbbit commented 1 year ago

@cdvr1993 could you post the final results?

DheerendraRathor commented 1 year ago

@cdvr1993 I think there is some miscommunication here between comments. I've no objection in converting all call methods to pointer receiver unless there is huge gain in performance with value receivers which I don't think is the case. This will also help you in stack use reduction.

I had objection in changes to graph.begin API which IMO reduced the code readability and maintainability.

jronak commented 1 year ago

I can't seem to look at the history of the PR, but what is the difference in the benchmarking numbers between the current changes vs with pointer receivers?

Thanks for this @cdvr1993!

cdvr1993 commented 1 year ago

I can't seem to look at the history of the PR, but what is the difference in the benchmarking numbers between the current changes vs with pointer receivers?

Thanks for this @cdvr1993!

Negligible. In terms of performance they are pretty much the same. The only advantage would be to reduce stack usage, but most of the usage is gone by removed the zap fields.

cdvr1993 commented 1 year ago

@jronak 0x988 vs 0x260, so 1.8KB

cdvr1993 commented 1 year ago

@rabbbit last benchmark results

go test ./internal/observability/... -bench . -count=10
goos: linux
goarch: amd64
pkg: go.uber.org/yarpc/internal/observability
cpu: AMD EPYC 7B13
BenchmarkMiddlewareHandle-96             1429688               844.9 ns/op
BenchmarkMiddlewareHandle-96             1443886               828.4 ns/op
BenchmarkMiddlewareHandle-96             1442847               828.4 ns/op
BenchmarkMiddlewareHandle-96             1451000               821.1 ns/op
BenchmarkMiddlewareHandle-96             1442511               833.8 ns/op
BenchmarkMiddlewareHandle-96             1436607               834.3 ns/op
BenchmarkMiddlewareHandle-96             1462096               815.1 ns/op
BenchmarkMiddlewareHandle-96             1425666               816.2 ns/op
BenchmarkMiddlewareHandle-96             1450243               824.1 ns/op
BenchmarkMiddlewareHandle-96             1458702               815.5 ns/op
PASS
ok      go.uber.org/yarpc/internal/observability        20.577s

cdvr1993 commented 1 year ago

@rabbbit done.

rabbbit commented 1 year ago

Do you know why the allocations were reduced too?

cdvr1993 commented 1 year ago

@biosvs updated it. Thanks.

yarpc / yarpc-go

Improve stack usage of all the Handle functions #2220

No assert

Previous

Now

With assert

Previous

Now