Replace scratch buffers with pools

whyrusleeping / cbor-gen

Codegen for cbor codecs on your types

MIT License

35 stars 25 forks source link

Replace scratch buffers with pools #65

Closed iand closed 2 years ago

iand commented 2 years ago

Adds pools for small buffers used for reading cbor headers and for larger buffers used when reading strings. This trades off some cpu for less pressure on the garbage collector. The benchmarks show a notable increase in cpu but allocations are amortized to near zero in many cases.

Internalising the management of scratch buffers simplifies the code and allows removal/deprecation of duplicate implementations for several functions.

Users will need to re-generate marshaling methods to benefit from the removal of scratch buffers from those methods.

Benchstat comparison with master:

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%    1123ns ± 3%   +99.16%  (p=0.000 n=9+10)
Unmarshaling-8       2.75µs ± 5%    3.53µs ± 4%   +28.63%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%    1694ns ± 1%  +127.69%  (p=0.000 n=10+9)
Deferred-8           1.68µs ± 1%    3.90µs ± 0%  +131.76%  (p=0.000 n=10+9)
MapMarshaling-8       317ns ± 0%     667ns ± 2%  +110.55%  (p=0.000 n=9+10)
MapUnmarshaling-8    2.71µs ±10%    3.33µs ± 3%   +23.26%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    1.96kB ± 0%   -42.94%  (p=0.000 n=10+10)
LinkScan-8             112B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Deferred-8            88.0B ± 0%     72.0B ± 0%   -18.18%  (p=0.000 n=10+10)
MapMarshaling-8       48.0B ± 0%      2.0B ± 0%   -95.83%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.54kB ± 0%   -39.15%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      21.0 ± 0%   -51.16%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Deferred-8             3.00 ± 0%      2.00 ± 0%   -33.33%  (p=0.000 n=10+10)
MapMarshaling-8        5.00 ± 0%      2.00 ± 0%   -60.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      27.0 ± 0%   -51.79%  (p=0.000 n=10+10)

mvdan commented 2 years ago

Note that allocating buffers like make([]byte, 8) will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a generic io.Reader.Read. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.

Stebalien commented 2 years ago

Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with go tool pprof -alloc_objects to verify?

iand commented 2 years ago

Note that allocating buffers like make([]byte, 8) will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a generic io.Reader.Read. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.

Some perhaps. But escape analysis shows that, in general, these scratch buffers escape:

~/wip/iand/cbor-gen/testing [master|✔] 
10:16 $ go build -gcflags="-m"

... elided...
./cbor_gen.go:32:17: make([]byte, 9) escapes to heap
./cbor_gen.go:50:7: t does not escape
./cbor_gen.go:50:37: leaking param: r
./cbor_gen.go:53:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:54:17: make([]byte, 8) escapes to heap
./cbor_gen.go:60:8: func literal does not escape
./cbor_gen.go:82:20: ... argument does not escape
./cbor_gen.go:82:21: extra escapes to heap
./cbor_gen.go:90:18: make([]uint64, extra) escapes to heap
./cbor_gen.go:97:25: ... argument does not escape
./cbor_gen.go:101:25: ... argument does not escape
./cbor_gen.go:101:26: maj escapes to heap
./cbor_gen.go:112:7: leaking param content: t
./cbor_gen.go:112:37: leaking param: w
./cbor_gen.go:121:17: make([]byte, 9) escapes to heap
./cbor_gen.go:179:7: leaking param content: t
./cbor_gen.go:179:39: leaking param: r
./cbor_gen.go:182:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:183:17: make([]byte, 8) escapes to heap
... elided...

iand commented 2 years ago

Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with go tool pprof -alloc_objects to verify?

Yes. The general approach in this kind of optimization is to profile step by step, analyzing the top allocating functions, optimizing and then repeating.

I'm following this up today with profiling clients of this package to see the effects against real usage.

iand commented 2 years ago

I compared this branch against the current master for a couple of downstream packages. That is I updated each package to the use latest cbor-gen (485c4754fc51) re-ran the generation and benchmarked, then repeated for this branch. Note for go-hamt-ipld I hand edited the generated files to remove the scratch buffers and calls to *Buf functions as per the comment in the generated file.

go-hamt-ipld

name             old time/op    new time/op    delta
SerializeNode-8    15.1µs ±21%    13.1µs ±24%     ~     (p=0.113 n=9+10)

name             old alloc/op   new alloc/op   delta
SerializeNode-8    15.0kB ± 0%    14.1kB ± 0%   -5.47%  (p=0.000 n=9+10)

name             old allocs/op  new allocs/op  delta
SerializeNode-8       112 ± 1%        61 ± 6%  -45.89%  (p=0.000 n=9+10)

go-amt-ipld

name                old time/op    new time/op    delta
AMTInsertBulk-8       18.5ms ±11%    19.1ms ±13%   +2.98%  (p=0.000 n=120+120)
AMTLoadAndInsert-8    128µs ±367%    137µs ±368%     ~     (p=0.157 n=110+110)
NodesForHeight-8      0.24ns ± 1%    0.24ns ± 1%   +0.06%  (p=0.039 n=109+115)

name                old gets/op    new gets/op    delta
AMTInsertBulk-8        60.8 ±452%     60.8 ±452%     ~     (p=1.000 n=110+110)
AMTLoadAndInsert-8      2.98 ±97%      2.98 ±97%     ~     (p=1.000 n=110+110)

name                old puts/op    new puts/op    delta
AMTInsertBulk-8        60.8 ±452%     60.8 ±452%     ~     (p=1.000 n=110+110)
AMTLoadAndInsert-8     3.00 ±100%     3.00 ±100%     ~     (p=1.000 n=110+110)

name                old alloc/op   new alloc/op   delta
AMTInsertBulk-8       6.37MB ± 5%    6.29MB ± 5%   -1.33%  (p=0.000 n=110+110)
AMTLoadAndInsert-8    334kB ±410%    328kB ±411%     ~     (p=0.259 n=110+110)
NodesForHeight-8       0.00B          0.00B          ~     (all equal)

name                old allocs/op  new allocs/op  delta
AMTInsertBulk-8        92.0k ± 9%     81.9k ±10%  -10.99%  (p=0.000 n=110+110)
AMTLoadAndInsert-8    2.27k ±389%    1.91k ±386%     ~     (p=0.169 n=110+110)
NodesForHeight-8        0.00           0.00          ~     (all equal)

Note: go-amt-ipld is on an old version of cbor-gen. I also benchmarked that version against current master and saw some performance regressions but I didn't dig into the cause. The variability in the benchmarks looks really odd, especially numbers of gets/puts for AMTInsertBulk, leading to a p of 1. The above benchmarks are based on current master.

iand commented 2 years ago

More comparisons between current master and this pool change.

lotus chain/state package

name                 old time/op    new time/op    delta
StateTreeSet-8          495ns ± 8%     512ns ±14%     ~     (p=0.326 n=9+10)
StateTreeSetFlush-8    10.0ms ± 0%    11.7ms ± 1%  +17.20%  (p=0.008 n=5+5)

name                 old alloc/op   new alloc/op   delta
StateTreeSet-8           359B ± 8%      364B ± 4%     ~     (p=0.197 n=9+9)
StateTreeSetFlush-8    7.35MB ± 0%    7.22MB ± 0%   -1.74%  (p=0.004 n=6+5)

name                 old allocs/op  new allocs/op  delta
StateTreeSet-8           5.00 ± 0%      5.00 ± 0%     ~     (all equal)
StateTreeSetFlush-8      142k ± 0%      132k ± 0%   -7.04%  (p=0.000 n=6+5)

lotus chain/types package

name                  old time/op    new time/op    delta
BlockHeaderMarshal-8     593ns ± 6%     834ns ± 1%  +40.78%  (p=0.000 n=10+10)
WinCounts-8             3.39µs ± 2%    3.36µs ± 2%     ~     (p=0.068 n=10+9)
SerializeMessage-8       533ns ± 6%     590ns ± 3%  +10.75%  (p=0.000 n=10+8)

name                  old alloc/op   new alloc/op   delta
BlockHeaderMarshal-8      112B ± 0%       64B ± 0%  -42.86%  (p=0.000 n=10+10)
WinCounts-8             1.11kB ± 0%    1.11kB ± 0%     ~     (all equal)
SerializeMessage-8        416B ± 0%      400B ± 0%   -3.85%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op  delta
BlockHeaderMarshal-8      16.0 ± 0%      12.0 ± 0%  -25.00%  (p=0.000 n=10+10)
WinCounts-8               21.0 ± 0%      21.0 ± 0%     ~     (all equal)
SerializeMessage-8        16.0 ± 0%      13.0 ± 0%  -18.75%  (p=0.000 n=10+10)

iand commented 2 years ago

I'll take a further look to see if I can reduce the slowdown. It may be that the addition of the header buffer pool is not worth the cpu overhead, but the string buffer may well be.

Stebalien commented 2 years ago

I'd really start by profiling the target program to see if this is a source of allocation issues.

iand commented 2 years ago

I'd really start by profiling the target program to see if this is a source of allocation issues.

Yes. This is what I already have from a full history lily node. I've rebuilt with a version of go-hamt-ipld that uses the pools and leaving it to run for a bit before running a comparison profile.

CBOR unmarshalling is right up there, so I believe optimization is worth investing in.

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily lily-master-allocs.pprof 
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_space
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 54.24TB, 74.79% of 72.52TB total
Dropped 3633 nodes (cum <= 0.36TB)
Showing top 10 nodes out of 275
      flat  flat%   sum%        cum   cum%
   39.13TB 53.96% 53.96%    39.13TB 53.96%  bytes.makeSlice
    3.35TB  4.62% 58.58%    14.66TB 20.21%  github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
    1.96TB  2.70% 61.28%    12.24TB 16.88%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
    1.69TB  2.33% 63.61%     1.69TB  2.33%  github.com/ipfs/go-cid.CidFromBytes
    1.67TB  2.30% 65.91%     1.67TB  2.30%  go.opencensus.io/stats/view.(*DistributionData).clone
    1.41TB  1.95% 67.85%     5.78TB  7.97%  github.com/dgraph-io/badger/v2.(*Txn).Get
    1.39TB  1.92% 69.78%     1.39TB  1.92%  github.com/whyrusleeping/cbor-gen.ReadByteArray
    1.30TB  1.80% 71.57%    40.43TB 55.76%  bytes.(*Buffer).grow
    1.17TB  1.61% 73.19%     1.17TB  1.61%  github.com/dgraph-io/badger/v2/table.(*Table).block
    1.16TB  1.61% 74.79%     1.17TB  1.61%  github.com/dgraph-io/badger/v2.(*DB).newTransaction

ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily lily-master-allocs.pprof 
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_objects
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 311306667274, 50.81% of 612656612901 total
Dropped 3596 nodes (cum <= 3063283064)
Showing top 10 nodes out of 312
      flat  flat%   sum%        cum   cum%
48044314111  7.84%  7.84% 48044314111  7.84%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
39225411024  6.40% 14.24% 39225411024  6.40%  github.com/ipfs/go-cid.CidFromBytes
37485368248  6.12% 20.36% 165872094588 27.07%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
34311629606  5.60% 25.96% 87380907218 14.26%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
32029887703  5.23% 31.19% 32034798094  5.23%  github.com/whyrusleeping/cbor-gen.ReadByteArray
31750037182  5.18% 36.37% 199192856458 32.51%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
27717356689  4.52% 40.90% 84075830794 13.72%  github.com/filecoin-project/go-hamt-ipld/v3.(*KV).UnmarshalCBOR
22379296201  3.65% 44.55% 41252074314  6.73%  bytes.(*Buffer).grow
19490588397  3.18% 47.73% 19490588397  3.18%  bytes.NewBuffer (inline)
18872778113  3.08% 50.81% 18872778113  3.08%  bytes.makeSlice

(note: this node is simply syncing the chain, not running any lily specific work)

iand commented 2 years ago

After running lily with the modified go-hamt-ipld for 30 minutes you can see that the cbor unmarshalling has reduced allocations by nearly half (Deferred, for example now responsible for 2.80% of all allocated objects, compared with 5.60% previously). I feel this profile is skewed somewhat because badger is still thrashing its cache somewhat so I'll leave it overnight and resample tomorrow.

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 42.93GB, 89.52% of 47.96GB total
Dropped 1437 nodes (cum <= 0.24GB)
Showing top 10 nodes out of 167
      flat  flat%   sum%        cum   cum%
   17.68GB 36.86% 36.86%    25.22GB 52.58%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    7.54GB 15.71% 52.58%     7.54GB 15.71%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
    5.95GB 12.41% 64.98%     5.95GB 12.41%  encoding/json.(*decodeState).literalStore
    5.93GB 12.36% 77.35%     5.93GB 12.36%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
    3.49GB  7.29% 84.63%     3.49GB  7.29%  bytes.makeSlice
    0.61GB  1.27% 85.91%     0.61GB  1.27%  github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
    0.50GB  1.04% 86.94%     2.94GB  6.13%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
    0.48GB  1.01% 87.95%     0.49GB  1.01%  github.com/whyrusleeping/cbor-gen.ReadByteArray
    0.43GB   0.9% 88.86%     0.43GB   0.9%  github.com/ipfs/go-cid.CidFromBytes
    0.32GB  0.67% 89.52%     1.41GB  2.94%  github.com/dgraph-io/badger/v2.(*Txn).Get

ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 272095232, 78.76% of 345487341 total
Dropped 1413 nodes (cum <= 1727436)
Showing top 10 nodes out of 191
      flat  flat%   sum%        cum   cum%
 101155979 29.28% 29.28%  101155979 29.28%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
 100562611 29.11% 58.39%  201718590 58.39%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  11829392  3.42% 61.81%   11829392  3.42%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
   9776194  2.83% 64.64%    9776194  2.83%  github.com/ipfs/go-cid.CidFromBytes
   9723821  2.81% 67.45%    9887663  2.86%  github.com/whyrusleeping/cbor-gen.ReadByteArray
   9688562  2.80% 70.26%   22619162  6.55%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
   9142424  2.65% 72.91%   37840905 10.95%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
   7879392  2.28% 75.19%   46080747 13.34%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
   7454212  2.16% 77.34%    7454212  2.16%  github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
   4882645  1.41% 78.76%    4882645  1.41%  bytes.NewBuffer (inline)

Stebalien commented 2 years ago

Awesome! @whyrusleeping or @Kubuxu can you remember why we didn't do this before?

whyrusleeping commented 2 years ago

I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the *Buf methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down

iand commented 2 years ago

Lily profile after 12 hours with the pool changes

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 740.40GB, 78.68% of 941.03GB total
Dropped 2669 nodes (cum <= 4.71GB)
Showing top 10 nodes out of 209
      flat  flat%   sum%        cum   cum%
  550.83GB 58.53% 58.53%   550.83GB 58.53%  bytes.makeSlice
   50.80GB  5.40% 63.93%   222.84GB 23.68%  github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
   21.93GB  2.33% 66.26%    31.31GB  3.33%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
   19.31GB  2.05% 68.32%    19.31GB  2.05%  github.com/ipfs/go-cid.CidFromBytes
   18.32GB  1.95% 70.26%   115.30GB 12.25%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
   17.57GB  1.87% 72.13%    75.74GB  8.05%  github.com/dgraph-io/badger/v2.(*Txn).Get
   16.37GB  1.74% 73.87%    16.37GB  1.74%  github.com/dgraph-io/badger/v2/table.(*Table).block
   15.44GB  1.64% 75.51%    15.45GB  1.64%  github.com/whyrusleeping/cbor-gen.ReadByteArray
   15.20GB  1.62% 77.13%   566.03GB 60.15%  bytes.(*Buffer).grow
   14.63GB  1.55% 78.68%    14.68GB  1.56%  github.com/dgraph-io/badger/v2.(*DB).newTransaction

ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 3417812671, 52.86% of 6466082967 total
Dropped 2646 nodes (cum <= 32330414)
Showing top 10 nodes out of 232
      flat  flat%   sum%        cum   cum%
 605703250  9.37%  9.37%  605703250  9.37%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
 464724635  7.19% 16.55% 1123818522 17.38%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
 435888632  6.74% 23.30%  435888632  6.74%  github.com/ipfs/go-cid.CidFromBytes
 345108583  5.34% 28.63%  345446512  5.34%  github.com/whyrusleeping/cbor-gen.ReadByteArray
 337844730  5.22% 33.86% 1439777385 22.27%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
 286784303  4.44% 38.29% 1740062413 26.91%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
 255089795  3.95% 42.24%  457805603  7.08%  bytes.(*Buffer).grow
 244209190  3.78% 46.01%  244209190  3.78%  bytes.NewBuffer (inline)
 224220629  3.47% 49.48%  224220629  3.47%  github.com/dgraph-io/badger/v2/table.(*Table).block
 218238924  3.38% 52.86%  218238924  3.38%  github.com/dgraph-io/badger/v2.(*DB).getMemTables

(*Node).UnmarshalCBOR is the root of the allocation tree for most CBOR operations. Without pools it, or the functions it calls, is responsible for 32.51% of all allocations. With the pools that is down to 26.91%. That's a pretty good reduction considering that Unmarshalling always has to allocate.

Added svg versions of the alloc_objects profiles which are surprisingly readable:

lily master:

lily-master

lily with pools:

lily-cbor-pools

iand commented 2 years ago

I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the *Buf methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down

This is what I tried originally. I had the generated files include a pool that the marshaling functions would use to create a buffer to pass to the cbor-gen library code.

I didn't profile it much because I felt it was quite messy. For example you have to deal with the client generating more than one file in the same package which means either holding some state so you only generate one pool or generating a separate pool for each file. Then you have to make the generation functions aware of the scope so they know which pool to use.

I had that working but then realised I could push the pools into the library and remove a load of duplicated code at the same time, leaving a cleaner cbor api to work with. So I didn't spend much time profiling the first version. I'll explore it again though.

iand commented 2 years ago

Closing in favour of #67