zeromq / gyre

Golang port of Zyre
GNU Lesser General Public License v3.0
89 stars 20 forks source link

When one node leaves a group, it causes a panic in other nodes on the network. #40

Closed afshin closed 8 years ago

afshin commented 8 years ago

Please don't expose any panic statements even if they are used internally. They can be captured with recover inside the library, but they are difficult (in this case, I actually can't seem to recover at all) to capture in clients of the library.

I have two applications running on the same network. When one calls Leave(), it causes the other app to panic. This may be a bug, but even if it isn't, the panic cannot be recovered:

panic: [136FF149F072A30744B54B23B831541C] message status isn't equal to peer status, 2 != 3

goroutine 8 [running]:
panic(0x7d5600, 0xc82021e6d0)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
github.com/zeromq/gyre.(*node).recvFromPeer(0xc8201a84b0, 0x7f7804729338, 0xc8201f4f00)
    /srv/go/src/github.com/zeromq/gyre/node.go:708 +0x9d2
github.com/zeromq/gyre.(*node).actor.func4(0x1, 0x0, 0x0)
    /srv/go/src/github.com/zeromq/gyre/node.go:842 +0x214
github.com/pebbe/zmq4.(*Reactor).Run(0xc820015b00, 0x989680, 0x0, 0x0)
    /srv/go/src/github.com/pebbe/zmq4/reactor.go:187 +0x870
github.com/zeromq/gyre.(*node).actor(0xc8201a84b0)
    /srv/go/src/github.com/zeromq/gyre/node.go:847 +0x312
created by github.com/zeromq/gyre.newGyre
    /srv/go/src/github.com/zeromq/gyre/gyre.go:94 +0x201

https://github.com/zeromq/gyre/blob/master/node.go#L702

https://github.com/zeromq/gyre/blob/master/node.go#L708

armen commented 8 years ago

I do agree that the panic shouldn't be used in a library but in this rare case crashing fast helps to identify the bugs. Having said that a pull request is always welcome.

In terms of the bug itself, I can't seem to reproduce the bug. These are the scenarios I tested:

1) Two nodes joined to a group then after a while one of them leaves the group 2) Two nodes in the cluster but one of them leaves an arbitrary group

These two scenarios worked fine for me. Maybe I'm missing something here. It would be nice and very helpful if you could come up with a test or a spinet if the bug can be reproduced at all.

Cheers

afshin commented 8 years ago

Hi! Thanks for the response. Here is a sample small program that crashes every time for me:

https://gist.github.com/afshin/0242be8726c37144407a713f8d941815

To test it, compile it and run two instances:

./gyre-leave -seconds 10

and

./gyre-leave

The first one will leave the group after 10 seconds (or whatever number you pick) and the second one will remain running unless something goes wrong. In my case, the second one always crashes with this error:

panic: [BBD6198AABE20BD2F3FFED3652D5DB7F] message status isn't equal to peer status, 2 != 3

goroutine 5 [running]:
panic(0x606000, 0xc8200b9810)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
github.com/zeromq/gyre.(*node).recvFromPeer(0xc82009e000, 0x7f33330b9740, 0xc82004ce80)
    /srv/go/src/github.com/zeromq/gyre/node.go:708 +0x9d2
github.com/zeromq/gyre.(*node).actor.func4(0x1, 0x0, 0x0)
    /srv/go/src/github.com/zeromq/gyre/node.go:842 +0x214
github.com/pebbe/zmq4.(*Reactor).Run(0xc82004c440, 0x989680, 0x0, 0x0)
    /srv/go/src/github.com/pebbe/zmq4/reactor.go:187 +0x870
github.com/zeromq/gyre.(*node).actor(0xc82009e000)
    /srv/go/src/github.com/zeromq/gyre/node.go:847 +0x312
created by github.com/zeromq/gyre.newGyre
    /srv/go/src/github.com/zeromq/gyre/gyre.go:94 +0x201
armen commented 8 years ago

Hey @afshin Thanks for the bug report and the test case. The issue has been fixed could you please double check to see if the bug has gone and if so please close the issue.

Thanks.

afshin commented 8 years ago

@armen thanks very much! I can confirm that this fixes the issue. Thank you for resolving it 👍