yarpc / yarpc-go

A message passing platform for Go
MIT License
404 stars 101 forks source link

Ability to set http headers without allocations #2119

Open rabbbit opened 2 years ago

rabbbit commented 2 years ago

Hello,

It seems not possible to add headers without allocations. Adding a header seems to be costing 6 (or 4) allocations.

Ideas

My initial thought was that the AddHeaders API is inefficient, but @jstroem below benchmarked this not be a problem.

First thought - No "AddHeader" API.

type ResponseWriter interface {
    io.Writer
        AddHeaders(Headers)

however, transport.Headers always allocated two maps

type Headers struct {
    items map[string]string
    originalItems map[string]string
}

yarpc canonical headers are lowercase, while http canonical headers are capitalized.

This leads to a very paradoxical situation where:

if I add lower-case http header

    resw.AddHeaders(transport.NewHeadersWithCapacity(1).With(
        "abc", "",
    ))

I get:

BenchmarkRespWriter-12           2514664               462.6 ns/op           672 B/op          4 allocs/op

This means two allocations for the maps in transport.Headers, 1 allocation for addToMetadata, and possibly 1 https://pkg.go.dev/net/http#CanonicalHeaderKey - not sure if the last one is included in my benchmark though.

But if I do correct HTTP canonical headers, which should be better

    resw.AddHeaders(transport.NewHeadersWithCapacity(1).With(
        "Abc", "",
    ))

I get

BenchmarkRespWriter-12           2390634               446.2 ns/op           678 B/op          6 allocs/op

so it's worse - yarpc will convert my correct header to lower-case first (extra alloc) and one another extra alloc that I don't understand.

Possible solutions

There are a few options - which one would you recommend?

Just add AddHeader

It would be relatively easy to add ResponseWriter.AddHeader(key, value string) - this would save me two allocations, leaving the remaining two.

AddRawHeader, or make AddSystemHeader public

In order to avoid Go triggering https://pkg.go.dev/net/http#CanonicalHeaderKey for me later on anyway, I would rather pass in HTTP canonical headers (Abc and not abc).

This means I would need something like ResponseWriter.AddRawHeader(key, value string) that wouldn't do transport.CanonicalizeHeaderKey for me. But I'm not sure about the other implications of doing this.

Anything else?

Please advise.

jstroem commented 2 years ago

To start the discussion on this I made a commit with benchmarks and a naive implementation of AddHeader for each protocol. The commit can be found here: https://github.com/yarpc/yarpc-go/commit/fc065104fed575717a14490b1e5db9f8d9306863.

Internally in yarpc the responsewriters might have a header they build up, so I decided to only benchmark the add-step and not the entire response.

Benchmark as follows:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
pkg: go.uber.org/yarpc/transport/grpc
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 7628671               156.5 ns/op            32 B/op          1 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 6875102               172.7 ns/op            35 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                 21641186                55.02 ns/op           32 B/op          1 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                 15583413                77.54 ns/op           35 B/op          2 allocs/op
--------------------------------------------------------
pkg: go.uber.org/yarpc/transport/http
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 4067061               280.8 ns/op           115 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 4016554               291.7 ns/op           119 B/op          3 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                  6927727               193.8 ns/op           127 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                  5692008               200.9 ns/op           128 B/op          3 allocs/op
--------------------------------------------------------
pkg: go.uber.org/yarpc/transport/tchannel
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 9356582               126.7 ns/op             0 B/op          0 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 5833646               204.2 ns/op             9 B/op          3 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                 28751949                42.84 ns/op            0 B/op          0 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                 11325139                88.67 ns/op            6 B/op          2 allocs/op

This shows that adding a AddHeader won't help on allocations.

rabbbit commented 2 years ago

Hey,

Thanks for looking into this!

There seems to be a lot to unpack here:

So this would show a clear performance win. However, after a brief look at the code, I'm not sure if the benchmarks are valid - I think we just keep getting errors since we're adding duplicate headers (?).

Ultimately though - it should be possible to add headers with 0 allocations, right?

rabbbit commented 2 years ago

I looked at this a bit more, and tried to reproduce the "header creation doesn't allocate" situation.

And it appears:

The code below is somewhat convoluted - tried to reproduce what AddHeaders does with an intermediary struct containing two maps.

func newStore1(capacity int) *store1 {
    s := store1{
        store1: make(map[string]string, 1),
    }
    return &s
}

func (s *store1) with(k, v string) *store1 {
    s.store1[k] = v
    return s
}

func do1(out map[string]string, k, v string) {
    s := newStore1(1).with(k, v)
    out[k] = s.store1[k]
}

type store2 struct {
    store1 map[string]string
    store2 map[string]string
}

func newStore2(capacity int) *store2 {
    s := store2{
        store1: make(map[string]string, 1),
        store2: make(map[string]string, 1),
    }
    return &s
}

func (s *store2) with(k, v string) *store2 {
    s.store1[k] = v
    s.store2[k] = v
    return s
}

func do2(out map[string]string, k, v string) {
    s := newStore2(1).with(k, v)
    out[k] = s.store1[k]
}

func BenchmarkBaseline(b *testing.B) {
    out := make(map[string]string, 1)
    for i := 0; i < b.N; i++ {
        out["me"] = "hello"
    }
    assert.NotEqual(b, "", out["me"])
}

func BenchmarkStatic1(b *testing.B) {
    out := make(map[string]string, 1)
    for i := 0; i < b.N; i++ {
        do1(out, "me", "hello")
    }
    assert.NotEqual(b, "", out["me"])
}

func BenchmarkStatic2(b *testing.B) {
    out := make(map[string]string, 1)
    for i := 0; i < b.N; i++ {
        do2(out, "me", "hello")
    }
    assert.NotEqual(b, "", out["me"])
}

Results in:

BenchmarkBaseline
BenchmarkBaseline-12            127339767                9.543 ns/op           0 B/op          0 allocs/op
BenchmarkStatic1
BenchmarkStatic1-12             45420019                28.76 ns/op            0 B/op          0 allocs/op
BenchmarkStatic2
BenchmarkStatic2-12             25021603                47.36 ns/op            0 B/op          0 allocs/op

So significantly slower, and the cost grows for each internal map, as expected.

Adding //go:noinline to newStore{1,2} produced

BenchmarkBaseline
BenchmarkBaseline-12            126278928                9.278 ns/op           0 B/op          0 allocs/op
BenchmarkStatic1
BenchmarkStatic1-12              7363964               166.1 ns/op           344 B/op          3 allocs/op
BenchmarkStatic2
BenchmarkStatic2-12              3971288               289.1 ns/op           688 B/op          5 allocs/op

So, in summary:

jstroem commented 2 years ago

I found one allocation to save in transport/grpc by switching from multierr.Combine to multierr.Append also a few others by being smart on what strings to use:

These are the optimizations:

name                                                         old time/op    new time/op    delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                           1.14µs ± 2%    1.12µs ±10%      ~     (p=0.251 n=9+10)
_ResponseWriter_AddHeaders/lowercase_no-value                   165ns ± 2%     103ns ±11%   -37.37%  (p=0.000 n=9+9)
_ResponseWriter_AddHeaders/titlecase                           1.21µs ± 3%    1.18µs ±11%      ~     (p=0.280 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                   192ns ± 4%     149ns ±15%   -22.46%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            345ns ±14%     394ns ±10%   +14.29%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                            357ns ± 9%     326ns ±11%    -8.69%  (p=0.021 n=9+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix     413ns ± 4%     312ns ±15%   -24.46%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            124ns ±10%     131ns ± 3%    +5.37%  (p=0.021 n=10+9)
_ResponseWriter_AddHeaders/titlecase                            209ns ± 9%     187ns ± 9%   -10.54%  (p=0.000 n=10+10)

name                                                         old alloc/op   new alloc/op   delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             264B ± 0%      232B ± 0%   -12.12%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/lowercase_no-value                   32.0B ± 0%      0.0B       -100.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             279B ± 0%      247B ± 0%   -11.47%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                   35.0B ± 0%      3.0B ± 0%   -91.43%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             117B ±10%      122B ± 7%      ~     (p=0.090 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             119B ± 6%      107B ± 7%   -10.17%  (p=0.000 n=9+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix      172B ± 8%      102B ± 8%   -40.82%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            0.00B          0.00B           ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                            9.00B ± 0%     3.00B ± 0%   -66.67%  (p=0.000 n=10+10)

name                                                         old allocs/op  new allocs/op  delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             4.00 ± 0%      3.00 ± 0%   -25.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/lowercase_no-value                    1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             5.00 ± 0%      4.00 ± 0%   -20.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                    2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             2.00 ± 0%      2.00 ± 0%      ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                             3.00 ± 0%      2.00 ± 0%   -33.33%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix      3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             0.00           0.00           ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                             3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.000 n=10+10)

And for the addHeader:

name                                                         time/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                          1.04µs ± 8%
_ResponseWriter_AddHeader/lowercase_no-value                 17.9ns ± 7%
_ResponseWriter_AddHeader/titlecase                          1.05µs ±10%
_ResponseWriter_AddHeader/titlecase_no-value                 39.3ns ±11%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                           259ns ± 5%
_ResponseWriter_AddHeader/titlecase                           161ns ± 7%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix    114ns ±10%
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                          47.6ns ±11%
_ResponseWriter_AddHeader/titlecase                           105ns ± 5%

name                                                         alloc/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            232B ± 0%
_ResponseWriter_AddHeader/lowercase_no-value                  0.00B     
_ResponseWriter_AddHeader/titlecase                            247B ± 0%
_ResponseWriter_AddHeader/titlecase_no-value                  3.00B ± 0%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            121B ± 8%
_ResponseWriter_AddHeader/titlecase                            106B ± 8%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix    91.0B ±12%
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                           0.00B     
_ResponseWriter_AddHeader/titlecase                           6.00B ± 0%

name                                                         allocs/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            3.00 ± 0%
_ResponseWriter_AddHeader/lowercase_no-value                   0.00     
_ResponseWriter_AddHeader/titlecase                            4.00 ± 0%
_ResponseWriter_AddHeader/titlecase_no-value                   1.00 ± 0%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            2.00 ± 0%
_ResponseWriter_AddHeader/titlecase                            1.00 ± 0%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix     0.00     
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            0.00     
_ResponseWriter_AddHeader/titlecase                            2.00 ± 0%

So as you say, AddHeader is faster and AddHeaders also becomes a bit faster best-case because of fewer allocations.

jstroem commented 2 years ago

I've created a PR with the current optimizations.