Server A is in control, waiting on a timeOut for "longJob" due in 4 hours.
Server B adds a new "quickJob" for 1 hours time. Server B takes control and sets a timeOut for 1 hour.
Say that for some other reason, after 10 minutes Server B crashes.
What happens?"quickJob" does not get executed until Server A's original timeOut expires.
This happens because a server which thinks it's in control is not regularly checking the dominator collection, so it doesn't know that another server has updated the job queue and taken control - it's just sat waiting for its timeOut to expire.
What should happen?
Server A should see that Server B has crashed, take back control of the job queue and set a timeOut for "quickJob".
If there was a 3rd server then Server C would be regularly checking and would take control when it sees that no other server is in control. So this bug can only happen with 2 servers.
Consider this situation, with only two servers:
timeOut
for"longJob"
due in 4 hours."quickJob"
for 1 hours time. Server B takes control and sets atimeOut
for 1 hour.What happens?
"quickJob"
does not get executed until Server A's originaltimeOut
expires.This happens because a server which thinks it's in control is not regularly checking the
dominator
collection, so it doesn't know that another server has updated the job queue and taken control - it's just sat waiting for itstimeOut
to expire.What should happen? Server A should see that Server B has crashed, take back control of the job queue and set a
timeOut
for"quickJob"
.If there was a 3rd server then Server C would be regularly checking and would take control when it sees that no other server is in control. So this bug can only happen with 2 servers.