weaveworks / mesh

A tool for building distributed applications.
Apache License 2.0
889 stars 107 forks source link

high rate of peer.routes() calculation resuting from gossip broadcasts #114

Closed murali-reddy closed 5 years ago

murali-reddy commented 5 years ago

When mesh router receives ProtocolGossipBroadcast traffic it attempts to relay broadcast to peers. In order to broadcast gossip recived from a source, routes.BroadcastAll() is performed on each received gossip. routes.BroadcastAll() calls peer.routes which is known to be O(n^2) operation.

As an optimization to prevent peer.routes calls, routes.lookupOrCalculate caches the calculated broadcast routes. However on topology changes routes are flushed. While this should be fine on a stable cluster, when there is constant toplogy changes cache misses can be expensive.

Following metrics (gathered as calls per second) were gathered with instrumented mesh on 150 node kubernetes cluster with weave-net using the mesh. Its not uncommon for some one to apply daemon set which results each node connecting to rest of the peers (so n^2 connection) resulting in significant topology changes. Hence misses in routes.lookupOrCalculate() resulting in calculating peer.routes on every call.

It would be desirable to prevent excessive peer.routes() calls resulting from the gossip broadcast

===================================================================
2019-09-17 7:23:0 Peers.garbageCollect(): 198
2019-09-17 7:23:0 routes.calculate()         -> routes.calculateBroadcast(): 59
2019-09-17 7:23:0 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 201
2019-09-17 7:23:0 routes.calculateUnicast(): 118
2019-09-17 7:23:0 connectionMaker.refresh(): 64
2019-09-17 7:23:0 rx gossip unicast: 0
2019-09-17 7:23:0 rx gossip broadcast: 183
2019-09-17 7:23:0 gossip broadcast - relay broadcasts: 192
2019-09-17 7:23:0 gossip broadcast - topology updates: 12
===================================================================
2019-09-17 7:23:1 Peers.garbageCollect(): 247
2019-09-17 7:23:1 routes.calculate()         -> routes.calculateBroadcast(): 77
2019-09-17 7:23:1 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:1 routes.calculateUnicast(): 155
2019-09-17 7:23:1 connectionMaker.refresh(): 88
2019-09-17 7:23:1 rx gossip unicast: 0
2019-09-17 7:23:1 rx gossip broadcast: 226
2019-09-17 7:23:1 gossip broadcast - relay broadcasts: 234
2019-09-17 7:23:1 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:2 Peers.garbageCollect(): 216
2019-09-17 7:23:2 routes.calculate()         -> routes.calculateBroadcast(): 80
2019-09-17 7:23:2 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 220
2019-09-17 7:23:2 routes.calculateUnicast(): 158
2019-09-17 7:23:2 connectionMaker.refresh(): 75
2019-09-17 7:23:2 rx gossip unicast: 0
2019-09-17 7:23:2 rx gossip broadcast: 209
2019-09-17 7:23:2 gossip broadcast - relay broadcasts: 206
2019-09-17 7:23:2 gossip broadcast - topology updates: 11
===================================================================
2019-09-17 7:23:3 Peers.garbageCollect(): 233
2019-09-17 7:23:3 routes.calculate()         -> routes.calculateBroadcast(): 87
2019-09-17 7:23:3 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 256
2019-09-17 7:23:3 routes.calculateUnicast(): 174
2019-09-17 7:23:3 connectionMaker.refresh(): 88
2019-09-17 7:23:3 rx gossip unicast: 0
2019-09-17 7:23:3 rx gossip broadcast: 240
2019-09-17 7:23:3 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:3 gossip broadcast - topology updates: 6
===================================================================
2019-09-17 7:23:4 Peers.garbageCollect(): 289
2019-09-17 7:23:4 routes.calculate()         -> routes.calculateBroadcast(): 88
2019-09-17 7:23:4 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 217
2019-09-17 7:23:4 routes.calculateUnicast(): 176
2019-09-17 7:23:4 connectionMaker.refresh(): 109
2019-09-17 7:23:4 rx gossip unicast: 0
2019-09-17 7:23:4 rx gossip broadcast: 204
2019-09-17 7:23:4 gossip broadcast - relay broadcasts: 278
2019-09-17 7:23:4 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:5 Peers.garbageCollect(): 253
2019-09-17 7:23:5 routes.calculate()         -> routes.calculateBroadcast(): 82
2019-09-17 7:23:5 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 246
2019-09-17 7:23:5 routes.calculateUnicast(): 164
2019-09-17 7:23:5 connectionMaker.refresh(): 96
2019-09-17 7:23:5 rx gossip unicast: 0
2019-09-17 7:23:5 rx gossip broadcast: 244
2019-09-17 7:23:5 gossip broadcast - relay broadcasts: 244
2019-09-17 7:23:5 gossip broadcast - topology updates: 4
===================================================================
2019-09-17 7:23:6 Peers.garbageCollect(): 234
2019-09-17 7:23:6 routes.calculate()         -> routes.calculateBroadcast(): 71
2019-09-17 7:23:6 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 225
2019-09-17 7:23:6 routes.calculateUnicast(): 144
2019-09-17 7:23:6 connectionMaker.refresh(): 93
2019-09-17 7:23:6 rx gossip unicast: 0
2019-09-17 7:23:6 rx gossip broadcast: 228
2019-09-17 7:23:6 gossip broadcast - relay broadcasts: 226
2019-09-17 7:23:6 gossip broadcast - topology updates: 2
===================================================================
2019-09-17 7:23:7 Peers.garbageCollect(): 262
2019-09-17 7:23:7 routes.calculate()         -> routes.calculateBroadcast(): 80
2019-09-17 7:23:7 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 250
2019-09-17 7:23:7 routes.calculateUnicast(): 158
2019-09-17 7:23:7 connectionMaker.refresh(): 96
2019-09-17 7:23:7 rx gossip unicast: 0
2019-09-17 7:23:7 rx gossip broadcast: 254
2019-09-17 7:23:7 gossip broadcast - relay broadcasts: 253
2019-09-17 7:23:7 gossip broadcast - topology updates: 1
===================================================================
2019-09-17 7:23:8 Peers.garbageCollect(): 264
2019-09-17 7:23:8 routes.calculate()         -> routes.calculateBroadcast(): 55
2019-09-17 7:23:8 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 259
2019-09-17 7:23:8 routes.calculateUnicast(): 109
2019-09-17 7:23:8 connectionMaker.refresh(): 60
2019-09-17 7:23:8 rx gossip unicast: 0
2019-09-17 7:23:8 rx gossip broadcast: 259
2019-09-17 7:23:8 gossip broadcast - relay broadcasts: 246
2019-09-17 7:23:8 gossip broadcast - topology updates: 3
===================================================================
2019-09-17 7:23:9 Peers.garbageCollect(): 266
2019-09-17 7:23:9 routes.calculate()         -> routes.calculateBroadcast(): 60
2019-09-17 7:23:9 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 318
2019-09-17 7:23:9 routes.calculateUnicast(): 120
2019-09-17 7:23:9 connectionMaker.refresh(): 71
2019-09-17 7:23:9 rx gossip unicast: 0
2019-09-17 7:23:9 rx gossip broadcast: 319
2019-09-17 7:23:9 gossip broadcast - relay broadcasts: 255
2019-09-17 7:23:9 gossip broadcast - topology updates: 0
===================================================================
2019-09-17 7:23:10 Peers.garbageCollect(): 308
2019-09-17 7:23:10 routes.calculate()         -> routes.calculateBroadcast(): 58
2019-09-17 7:23:10 routes.lookupOrCalculate() -> routes.calculateBroadcast(): 255
2019-09-17 7:23:10 routes.calculateUnicast(): 116
2019-09-17 7:23:10 connectionMaker.refresh(): 73
2019-09-17 7:23:10 rx gossip unicast: 0
2019-09-17 7:23:10 rx gossip broadcast: 258
2019-09-17 7:23:10 gossip broadcast - relay broadcasts: 302
2019-09-17 7:23:10 gossip broadcast - topology updates: 2
===================================================================
bboreham commented 5 years ago

Fixed by #120