Running the cluster without doing changes causes unclean shutdown

perbu commented 1 year ago

Hi there Wind.

Excellent work on the tests. 🥇 I'm very happy about this. We're 99% of the way.

I struggle a bit with that leak test I did. So what I do is run a cluster node with the following arguments: -node-name testnode -members localhost:7946 -raft-bootstrap true -storage-way memory

This runs cleanly, but if I shut it down (Ctrl-C), the following happens:

12:39PM INF comqtt server started
^C12:39PM WRN caught signal, stopping...
12:39PM INF stopping raft...
12:39PM ??? 2023-08-01T10:39:42.648+0200 [ERROR] raft: failed to take snapshot: error="nothing new to snapshot"
12:39PM WRN failed to create snapshot!
12:39PM INF raft stopped
12:39PM INF stopping node...
12:39PM ??? 2023/08/01 10:39:42 [INFO] serf: EventMemberLeave: testnode 127.0.0.1
12:39PM ERR not graceful stop
12:39PM DBG system event loop halted

I suspect this causes the leaktest, which basically does this, does not to complete correctly, leaving a couple of goroutines hanging. Is the error "nothing new to snapshot" - really an error?

All the best, Per.

perbu commented 1 year ago

Even though I'm sending a few messages through, I still get some errors on Ctrl-C shutdown:

1:21PM INF apply raft log filter=testtopic/test from=testnode leader=testnode type=8
^C1:21PM WRN caught signal, stopping...
1:21PM INF stopping raft...
1:21PM ??? 2023-08-01T11:21:42.659+0200 [INFO]  snapshot: creating new snapshot: path=data/testnode/snapshots/6-10-1690881702659.tmp
1:21PM ??? 2023-08-01T11:21:42.694+0200 [INFO]  snapshot: reaping snapshot: path=data/testnode/snapshots/4-6-1690881574192
1:21PM ERR not graceful stop

Do you know what is causing the "not graceful stop"?

perbu commented 1 year ago

Okey, I think I got it.

In cluster/agent.go, in the Stop() function there is a recover, which masks a panic. The shutdown tries to shutdown the GRPC service, which isn't defined, perhaps because it was never started.

The recover() masks the stacktrace. I have a fix for this, but I'm wondering if you agree that it is a good idea to remove the recover(). Perhaps it makes sense if there are other activities that must finish during shutdown, but otherwise I think it is a bad practice to mask panics.

Let me know and I'll have a PR for you.

wind-c commented 1 year ago

ok, you try to fix it.

wind-c / comqtt

Running the cluster without doing changes causes unclean shutdown #60