weaveworks / weave

Simple, resilient multi-host containers networking and more.
https://www.weave.works
Apache License 2.0
6.62k stars 671 forks source link

weave-npc should reconcile ipsets/rules on restart #3771

Open Quentin-M opened 4 years ago

Quentin-M commented 4 years ago

Forking off this issue: https://github.com/weaveworks/weave/issues/3764 as per @bboreham's request.

TL;DR >

As of right now, when one of weave-npc's controller/go-routine panics, weave-npc will simply log the panic rather than propagating it in order to restart the go-routine, or in order to restart weave-npc as a whole (thus potentially saving it from a panic loop if the memory structures are in an unexpected state. This will leave weave-npc running in a non-functioning state.

Furthermore, when weave-npc restarts, it incurs a 10s+ downtime as weave-npc resets every IPSets/Rules, then re-creates them, instead of gracefully reconciling the host / desired / current states. The trouble is that, when a bad informer sends unexpected data (as per issue above), all weave-npc containers will crash at once, hence creating a full cluster downtime - potentially lengthened by the slowdown of the API due to sheer amount of requests.

/cc @murali-reddy

bboreham commented 4 years ago

Note I said in #3764 that it is written not to "simply log".

"it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running"

Quentin-M commented 4 years ago

The code snippet explicitly intercepts the panic from the crashed routine with recover(), log and move on. Only if the "ReallyCrash" global variable is set to true, or if an additional handler is globally registered with a new panic or with an exit syscall, will it do anything besides logging. Code works as expected, behavior is however wrong.

Best Regards, Quentin Machu, Head of DevOps, BitMEX +1 (415) 720 1243 On Feb 19, 2020, 03:46 -0800, Bryan Boreham notifications@github.com, wrote:

Note I said in #3764 that it is written not to "simply log". "it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bboreham commented 4 years ago

Only if the "ReallyCrash" global variable is set to true,

ReallyCrash defaults to true. Why would it change, in this context?

gobomb commented 4 years ago

I found ReallyCrash is true but the process didn't crash. I am confused.

gobomb commented 4 years ago

I think it will block at this line: https://github.com/kubernetes/client-go/blob/319dbfd0ed290ad5dbbbe252d27cb5bc9181e6be/tools/cache/controller.go#L147 , when the panic was called in wait.Until(c.processLoop, time.Second, stopCh) .

bboreham commented 4 years ago

@gobomb yes, very good.

I recreated the problem and here is a stack trace from it waiting:

goroutine 107 [semacquire]:
sync.runtime_Semacquire(0xc0003e6568)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0003e6560)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
k8s.io/apimachinery/pkg/util/wait.(*Group).Wait(0xc0003e6560)
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:49 +0x2d
panic(0x13d9360, 0x17cdc10)
        /usr/local/go/src/runtime/panic.go:969 +0x166
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:57 +0x1c8
panic(0x13d9360, 0x17cdc10)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/weaveworks/weave/npc.(*controller).AddPod(0xc00044b260, 0x181ad20, 0xc000120000, 0xc00002c000, 0x0, 0x0)
        /go/src/github.com/weaveworks/weave/npc/controller.go:125 +0x136

Do you know if there is a Kubernetes issue filed?

bboreham commented 4 years ago

Found your Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/93641