Closed iand closed 2 years ago
Note that allocating buffers like make([]byte, 8)
will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a generic io.Reader.Read
. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.
Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with go tool pprof -alloc_objects
to verify?
Note that allocating buffers like
make([]byte, 8)
will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a genericio.Reader.Read
. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.
Some perhaps. But escape analysis shows that, in general, these scratch buffers escape:
~/wip/iand/cbor-gen/testing [master|✔]
10:16 $ go build -gcflags="-m"
... elided...
./cbor_gen.go:32:17: make([]byte, 9) escapes to heap
./cbor_gen.go:50:7: t does not escape
./cbor_gen.go:50:37: leaking param: r
./cbor_gen.go:53:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:54:17: make([]byte, 8) escapes to heap
./cbor_gen.go:60:8: func literal does not escape
./cbor_gen.go:82:20: ... argument does not escape
./cbor_gen.go:82:21: extra escapes to heap
./cbor_gen.go:90:18: make([]uint64, extra) escapes to heap
./cbor_gen.go:97:25: ... argument does not escape
./cbor_gen.go:101:25: ... argument does not escape
./cbor_gen.go:101:26: maj escapes to heap
./cbor_gen.go:112:7: leaking param content: t
./cbor_gen.go:112:37: leaking param: w
./cbor_gen.go:121:17: make([]byte, 9) escapes to heap
./cbor_gen.go:179:7: leaking param content: t
./cbor_gen.go:179:39: leaking param: r
./cbor_gen.go:182:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:183:17: make([]byte, 8) escapes to heap
... elided...
Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with
go tool pprof -alloc_objects
to verify?
Yes. The general approach in this kind of optimization is to profile step by step, analyzing the top allocating functions, optimizing and then repeating.
I'm following this up today with profiling clients of this package to see the effects against real usage.
I compared this branch against the current master for a couple of downstream packages. That is I updated each package to the use latest cbor-gen (485c4754fc51) re-ran the generation and benchmarked, then repeated for this branch. Note for go-hamt-ipld
I hand edited the generated files to remove the scratch buffers and calls to *Buf
functions as per the comment in the generated file.
name old time/op new time/op delta
SerializeNode-8 15.1µs ±21% 13.1µs ±24% ~ (p=0.113 n=9+10)
name old alloc/op new alloc/op delta
SerializeNode-8 15.0kB ± 0% 14.1kB ± 0% -5.47% (p=0.000 n=9+10)
name old allocs/op new allocs/op delta
SerializeNode-8 112 ± 1% 61 ± 6% -45.89% (p=0.000 n=9+10)
name old time/op new time/op delta
AMTInsertBulk-8 18.5ms ±11% 19.1ms ±13% +2.98% (p=0.000 n=120+120)
AMTLoadAndInsert-8 128µs ±367% 137µs ±368% ~ (p=0.157 n=110+110)
NodesForHeight-8 0.24ns ± 1% 0.24ns ± 1% +0.06% (p=0.039 n=109+115)
name old gets/op new gets/op delta
AMTInsertBulk-8 60.8 ±452% 60.8 ±452% ~ (p=1.000 n=110+110)
AMTLoadAndInsert-8 2.98 ±97% 2.98 ±97% ~ (p=1.000 n=110+110)
name old puts/op new puts/op delta
AMTInsertBulk-8 60.8 ±452% 60.8 ±452% ~ (p=1.000 n=110+110)
AMTLoadAndInsert-8 3.00 ±100% 3.00 ±100% ~ (p=1.000 n=110+110)
name old alloc/op new alloc/op delta
AMTInsertBulk-8 6.37MB ± 5% 6.29MB ± 5% -1.33% (p=0.000 n=110+110)
AMTLoadAndInsert-8 334kB ±410% 328kB ±411% ~ (p=0.259 n=110+110)
NodesForHeight-8 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
AMTInsertBulk-8 92.0k ± 9% 81.9k ±10% -10.99% (p=0.000 n=110+110)
AMTLoadAndInsert-8 2.27k ±389% 1.91k ±386% ~ (p=0.169 n=110+110)
NodesForHeight-8 0.00 0.00 ~ (all equal)
Note: go-amt-ipld
is on an old version of cbor-gen. I also benchmarked that version against current master and saw some performance regressions but I didn't dig into the cause. The variability in the benchmarks looks really odd, especially numbers of gets/puts for AMTInsertBulk, leading to a p of 1. The above benchmarks are based on current master.
More comparisons between current master and this pool change.
name old time/op new time/op delta
StateTreeSet-8 495ns ± 8% 512ns ±14% ~ (p=0.326 n=9+10)
StateTreeSetFlush-8 10.0ms ± 0% 11.7ms ± 1% +17.20% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
StateTreeSet-8 359B ± 8% 364B ± 4% ~ (p=0.197 n=9+9)
StateTreeSetFlush-8 7.35MB ± 0% 7.22MB ± 0% -1.74% (p=0.004 n=6+5)
name old allocs/op new allocs/op delta
StateTreeSet-8 5.00 ± 0% 5.00 ± 0% ~ (all equal)
StateTreeSetFlush-8 142k ± 0% 132k ± 0% -7.04% (p=0.000 n=6+5)
name old time/op new time/op delta
BlockHeaderMarshal-8 593ns ± 6% 834ns ± 1% +40.78% (p=0.000 n=10+10)
WinCounts-8 3.39µs ± 2% 3.36µs ± 2% ~ (p=0.068 n=10+9)
SerializeMessage-8 533ns ± 6% 590ns ± 3% +10.75% (p=0.000 n=10+8)
name old alloc/op new alloc/op delta
BlockHeaderMarshal-8 112B ± 0% 64B ± 0% -42.86% (p=0.000 n=10+10)
WinCounts-8 1.11kB ± 0% 1.11kB ± 0% ~ (all equal)
SerializeMessage-8 416B ± 0% 400B ± 0% -3.85% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
BlockHeaderMarshal-8 16.0 ± 0% 12.0 ± 0% -25.00% (p=0.000 n=10+10)
WinCounts-8 21.0 ± 0% 21.0 ± 0% ~ (all equal)
SerializeMessage-8 16.0 ± 0% 13.0 ± 0% -18.75% (p=0.000 n=10+10)
I'll take a further look to see if I can reduce the slowdown. It may be that the addition of the header buffer pool is not worth the cpu overhead, but the string buffer may well be.
I'd really start by profiling the target program to see if this is a source of allocation issues.
I'd really start by profiling the target program to see if this is a source of allocation issues.
Yes. This is what I already have from a full history lily node. I've rebuilt with a version of go-hamt-ipld
that uses the pools and leaving it to run for a bit before running a comparison profile.
CBOR unmarshalling is right up there, so I believe optimization is worth investing in.
ubuntu@lily-fullhistory:~$ go tool pprof lily/lily lily-master-allocs.pprof
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_space
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 54.24TB, 74.79% of 72.52TB total
Dropped 3633 nodes (cum <= 0.36TB)
Showing top 10 nodes out of 275
flat flat% sum% cum cum%
39.13TB 53.96% 53.96% 39.13TB 53.96% bytes.makeSlice
3.35TB 4.62% 58.58% 14.66TB 20.21% github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
1.96TB 2.70% 61.28% 12.24TB 16.88% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
1.69TB 2.33% 63.61% 1.69TB 2.33% github.com/ipfs/go-cid.CidFromBytes
1.67TB 2.30% 65.91% 1.67TB 2.30% go.opencensus.io/stats/view.(*DistributionData).clone
1.41TB 1.95% 67.85% 5.78TB 7.97% github.com/dgraph-io/badger/v2.(*Txn).Get
1.39TB 1.92% 69.78% 1.39TB 1.92% github.com/whyrusleeping/cbor-gen.ReadByteArray
1.30TB 1.80% 71.57% 40.43TB 55.76% bytes.(*Buffer).grow
1.17TB 1.61% 73.19% 1.17TB 1.61% github.com/dgraph-io/badger/v2/table.(*Table).block
1.16TB 1.61% 74.79% 1.17TB 1.61% github.com/dgraph-io/badger/v2.(*DB).newTransaction
ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily lily-master-allocs.pprof
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_objects
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 311306667274, 50.81% of 612656612901 total
Dropped 3596 nodes (cum <= 3063283064)
Showing top 10 nodes out of 312
flat flat% sum% cum cum%
48044314111 7.84% 7.84% 48044314111 7.84% github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
39225411024 6.40% 14.24% 39225411024 6.40% github.com/ipfs/go-cid.CidFromBytes
37485368248 6.12% 20.36% 165872094588 27.07% github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
34311629606 5.60% 25.96% 87380907218 14.26% github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
32029887703 5.23% 31.19% 32034798094 5.23% github.com/whyrusleeping/cbor-gen.ReadByteArray
31750037182 5.18% 36.37% 199192856458 32.51% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
27717356689 4.52% 40.90% 84075830794 13.72% github.com/filecoin-project/go-hamt-ipld/v3.(*KV).UnmarshalCBOR
22379296201 3.65% 44.55% 41252074314 6.73% bytes.(*Buffer).grow
19490588397 3.18% 47.73% 19490588397 3.18% bytes.NewBuffer (inline)
18872778113 3.08% 50.81% 18872778113 3.08% bytes.makeSlice
(note: this node is simply syncing the chain, not running any lily specific work)
After running lily with the modified go-hamt-ipld
for 30 minutes you can see that the cbor unmarshalling has reduced allocations by nearly half (Deferred
, for example now responsible for 2.80% of all allocated objects, compared with 5.60% previously). I feel this profile is skewed somewhat because badger is still thrashing its cache somewhat so I'll leave it overnight and resample tomorrow.
ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 42.93GB, 89.52% of 47.96GB total
Dropped 1437 nodes (cum <= 0.24GB)
Showing top 10 nodes out of 167
flat flat% sum% cum cum%
17.68GB 36.86% 36.86% 25.22GB 52.58% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
7.54GB 15.71% 52.58% 7.54GB 15.71% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
5.95GB 12.41% 64.98% 5.95GB 12.41% encoding/json.(*decodeState).literalStore
5.93GB 12.36% 77.35% 5.93GB 12.36% github.com/dgraph-io/ristretto/z.(*Bloom).Size
3.49GB 7.29% 84.63% 3.49GB 7.29% bytes.makeSlice
0.61GB 1.27% 85.91% 0.61GB 1.27% github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
0.50GB 1.04% 86.94% 2.94GB 6.13% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
0.48GB 1.01% 87.95% 0.49GB 1.01% github.com/whyrusleeping/cbor-gen.ReadByteArray
0.43GB 0.9% 88.86% 0.43GB 0.9% github.com/ipfs/go-cid.CidFromBytes
0.32GB 0.67% 89.52% 1.41GB 2.94% github.com/dgraph-io/badger/v2.(*Txn).Get
ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 272095232, 78.76% of 345487341 total
Dropped 1413 nodes (cum <= 1727436)
Showing top 10 nodes out of 191
flat flat% sum% cum cum%
101155979 29.28% 29.28% 101155979 29.28% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
100562611 29.11% 58.39% 201718590 58.39% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
11829392 3.42% 61.81% 11829392 3.42% github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
9776194 2.83% 64.64% 9776194 2.83% github.com/ipfs/go-cid.CidFromBytes
9723821 2.81% 67.45% 9887663 2.86% github.com/whyrusleeping/cbor-gen.ReadByteArray
9688562 2.80% 70.26% 22619162 6.55% github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
9142424 2.65% 72.91% 37840905 10.95% github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
7879392 2.28% 75.19% 46080747 13.34% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
7454212 2.16% 77.34% 7454212 2.16% github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
4882645 1.41% 78.76% 4882645 1.41% bytes.NewBuffer (inline)
Awesome! @whyrusleeping or @Kubuxu can you remember why we didn't do this before?
I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the *Buf
methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down
Lily profile after 12 hours with the pool changes
ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 740.40GB, 78.68% of 941.03GB total
Dropped 2669 nodes (cum <= 4.71GB)
Showing top 10 nodes out of 209
flat flat% sum% cum cum%
550.83GB 58.53% 58.53% 550.83GB 58.53% bytes.makeSlice
50.80GB 5.40% 63.93% 222.84GB 23.68% github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
21.93GB 2.33% 66.26% 31.31GB 3.33% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
19.31GB 2.05% 68.32% 19.31GB 2.05% github.com/ipfs/go-cid.CidFromBytes
18.32GB 1.95% 70.26% 115.30GB 12.25% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
17.57GB 1.87% 72.13% 75.74GB 8.05% github.com/dgraph-io/badger/v2.(*Txn).Get
16.37GB 1.74% 73.87% 16.37GB 1.74% github.com/dgraph-io/badger/v2/table.(*Table).block
15.44GB 1.64% 75.51% 15.45GB 1.64% github.com/whyrusleeping/cbor-gen.ReadByteArray
15.20GB 1.62% 77.13% 566.03GB 60.15% bytes.(*Buffer).grow
14.63GB 1.55% 78.68% 14.68GB 1.56% github.com/dgraph-io/badger/v2.(*DB).newTransaction
ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 3417812671, 52.86% of 6466082967 total
Dropped 2646 nodes (cum <= 32330414)
Showing top 10 nodes out of 232
flat flat% sum% cum cum%
605703250 9.37% 9.37% 605703250 9.37% github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
464724635 7.19% 16.55% 1123818522 17.38% github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
435888632 6.74% 23.30% 435888632 6.74% github.com/ipfs/go-cid.CidFromBytes
345108583 5.34% 28.63% 345446512 5.34% github.com/whyrusleeping/cbor-gen.ReadByteArray
337844730 5.22% 33.86% 1439777385 22.27% github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
286784303 4.44% 38.29% 1740062413 26.91% github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
255089795 3.95% 42.24% 457805603 7.08% bytes.(*Buffer).grow
244209190 3.78% 46.01% 244209190 3.78% bytes.NewBuffer (inline)
224220629 3.47% 49.48% 224220629 3.47% github.com/dgraph-io/badger/v2/table.(*Table).block
218238924 3.38% 52.86% 218238924 3.38% github.com/dgraph-io/badger/v2.(*DB).getMemTables
(*Node).UnmarshalCBOR
is the root of the allocation tree for most CBOR operations. Without pools it, or the functions it calls, is responsible for 32.51% of all allocations. With the pools that is down to 26.91%. That's a pretty good reduction considering that Unmarshalling always has to allocate.
Added svg versions of the alloc_objects profiles which are surprisingly readable:
lily master:
lily with pools:
I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the
*Buf
methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down
This is what I tried originally. I had the generated files include a pool that the marshaling functions would use to create a buffer to pass to the cbor-gen library code.
I didn't profile it much because I felt it was quite messy. For example you have to deal with the client generating more than one file in the same package which means either holding some state so you only generate one pool or generating a separate pool for each file. Then you have to make the generation functions aware of the scope so they know which pool to use.
I had that working but then realised I could push the pools into the library and remove a load of duplicated code at the same time, leaving a cleaner cbor api to work with. So I didn't spend much time profiling the first version. I'll explore it again though.
Closing in favour of #67
Adds pools for small buffers used for reading cbor headers and for larger buffers used when reading strings. This trades off some cpu for less pressure on the garbage collector. The benchmarks show a notable increase in cpu but allocations are amortized to near zero in many cases.
Internalising the management of scratch buffers simplifies the code and allows removal/deprecation of duplicate implementations for several functions.
Users will need to re-generate marshaling methods to benefit from the removal of scratch buffers from those methods.
Benchstat comparison with master: