Open Quentin-M opened 4 years ago
Note I said in #3764 that it is written not to "simply log".
"it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running"
The code snippet explicitly intercepts the panic from the crashed routine with recover(), log and move on. Only if the "ReallyCrash" global variable is set to true, or if an additional handler is globally registered with a new panic or with an exit syscall, will it do anything besides logging. Code works as expected, behavior is however wrong.
Best Regards, Quentin Machu, Head of DevOps, BitMEX +1 (415) 720 1243 On Feb 19, 2020, 03:46 -0800, Bryan Boreham notifications@github.com, wrote:
Note I said in #3764 that it is written not to "simply log". "it appears to me that the intention is to re-raise the panic and hence exit the whole program. I am mystified why it keeps on running" — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Only if the "ReallyCrash" global variable is set to true,
ReallyCrash
defaults to true
. Why would it change, in this context?
I found ReallyCrash
is true
but the process didn't crash. I am confused.
I think it will block at this line: https://github.com/kubernetes/client-go/blob/319dbfd0ed290ad5dbbbe252d27cb5bc9181e6be/tools/cache/controller.go#L147 , when the panic was called in wait.Until(c.processLoop, time.Second, stopCh)
.
@gobomb yes, very good.
I recreated the problem and here is a stack trace from it waiting:
goroutine 107 [semacquire]:
sync.runtime_Semacquire(0xc0003e6568)
/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc0003e6560)
/usr/local/go/src/sync/waitgroup.go:130 +0x64
k8s.io/apimachinery/pkg/util/wait.(*Group).Wait(0xc0003e6560)
/go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:49 +0x2d
panic(0x13d9360, 0x17cdc10)
/usr/local/go/src/runtime/panic.go:969 +0x166
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/weaveworks/weave/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:57 +0x1c8
panic(0x13d9360, 0x17cdc10)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/weaveworks/weave/npc.(*controller).AddPod(0xc00044b260, 0x181ad20, 0xc000120000, 0xc00002c000, 0x0, 0x0)
/go/src/github.com/weaveworks/weave/npc/controller.go:125 +0x136
Do you know if there is a Kubernetes issue filed?
Found your Kubernetes issue: https://github.com/kubernetes/kubernetes/issues/93641
Forking off this issue: https://github.com/weaveworks/weave/issues/3764 as per @bboreham's request.
TL;DR >
As of right now, when one of weave-npc's controller/go-routine panics, weave-npc will simply log the panic rather than propagating it in order to restart the go-routine, or in order to restart weave-npc as a whole (thus potentially saving it from a panic loop if the memory structures are in an unexpected state. This will leave weave-npc running in a non-functioning state.
Furthermore, when weave-npc restarts, it incurs a 10s+ downtime as weave-npc resets every IPSets/Rules, then re-creates them, instead of gracefully reconciling the host / desired / current states. The trouble is that, when a bad informer sends unexpected data (as per issue above), all weave-npc containers will crash at once, hence creating a full cluster downtime - potentially lengthened by the slowdown of the API due to sheer amount of requests.
/cc @murali-reddy