zk1931 / jzab

ZooKeeper Atomic Broadcast in Java
http://zk1931.github.io/jzab/master/
Apache License 2.0
54 stars 23 forks source link

Bug about synchronization. #205

Closed EasonLiao closed 9 years ago

EasonLiao commented 9 years ago

I found this bug in pulsed integration test of multiple servers.

I'll try to explain the bug here :

First we have two phase synchronizations, the first message from follower is SYNC_HISTORY and the second message is JOIN. For each message we'll construct SyncPeerTask and pass the last sync zxid and last seen configuration. And the invariant is that the last seen configuration is always <= last sync zxid because at the end of synchronization the follower will cleanup all the configuration files which > last zxid in log and in cleanup function we've sanity check if there's no configuration files after cleaning up then raises an exception.

And during the test this exception has been raise occasionally. I logged the out put and found the reason. Because the invariant

the last seen configuration is always <= last sync zxid

is violated. The reason is tricky, the leader maintains the lastAck zxid, this is not necessarily most updated, and when it receives SYNC_HISTORY message it will construct and launch the SyncPeerTask with last synced zxid = lastAck && last seen config = persistence.getLastSeenConfig, every time the SyncProposalProcessor of leader fsyncs the data to disk it will also put the message in queue of leader so it can update lastAck field. And at this time if another follower is during joining process it's possible that the leader fsynced the COP to log but before the lastAck gets updated it receives SYNC_HISTORY, in this case the invariant the last seen configuration is always <= last sync zxid will be violated.

This bug should be not hard to fix and I'll add a patch today.