I found this bug in pulsed integration test of multiple servers.
I'll try to explain the bug here :
First we have two phase synchronizations, the first message from follower is SYNC_HISTORY and the second message is JOIN. For each message we'll construct SyncPeerTask and pass the last sync zxid and last seen configuration. And the invariant is that the last seen configuration is always <= last sync zxid because at the end of synchronization the follower will cleanup all the configuration files which > last zxid in log and in cleanup function we've sanity check if there's no configuration files after cleaning up then raises an exception.
And during the test this exception has been raise occasionally. I logged the out put and found the reason. Because the invariant
the last seen configuration is always <= last sync zxid
is violated. The reason is tricky, the leader maintains the lastAck zxid, this is not necessarily most updated, and when it receives SYNC_HISTORY message it will construct and launch the SyncPeerTask with last synced zxid = lastAck && last seen config = persistence.getLastSeenConfig, every time the SyncProposalProcessor of leader fsyncs the data to disk it will also put the message in queue of leader so it can update lastAck field. And at this time if another follower is during joining process it's possible that the leader fsynced the COP to log but before the lastAck gets updated it receives SYNC_HISTORY, in this case the invariant the last seen configuration is always <= last sync zxid will be violated.
This bug should be not hard to fix and I'll add a patch today.
I found this bug in pulsed integration test of multiple servers.
I'll try to explain the bug here :
First we have two phase synchronizations, the first message from follower is SYNC_HISTORY and the second message is JOIN. For each message we'll construct SyncPeerTask and pass the last sync zxid and last seen configuration. And the invariant is that the last seen configuration is always <= last sync zxid because at the end of synchronization the follower will cleanup all the configuration files which > last zxid in log and in cleanup function we've sanity check if there's no configuration files after cleaning up then raises an exception.
And during the test this exception has been raise occasionally. I logged the out put and found the reason. Because the invariant
is violated. The reason is tricky, the leader maintains the lastAck zxid, this is not necessarily most updated, and when it receives SYNC_HISTORY message it will construct and launch the SyncPeerTask with
last synced zxid = lastAck && last seen config = persistence.getLastSeenConfig
, every time the SyncProposalProcessor of leader fsyncs the data to disk it will also put the message in queue of leader so it can update lastAck field. And at this time if another follower is during joining process it's possible that the leader fsynced the COP to log but before the lastAck gets updated it receives SYNC_HISTORY, in this case the invariantthe last seen configuration is always <= last sync zxid
will be violated.This bug should be not hard to fix and I'll add a patch today.