wildhart / meteor.jobs

A simple job scheduler for Meteor.js
MIT License
18 stars 9 forks source link

Potential for no server to be in control #1

Closed wildhart closed 5 years ago

wildhart commented 5 years ago

Consider this situation, with only two servers:

  1. Server A is in control, waiting on a timeOut for "longJob" due in 4 hours.
  2. Server B adds a new "quickJob" for 1 hours time. Server B takes control and sets a timeOut for 1 hour.
  3. Say that for some other reason, after 10 minutes Server B crashes.

What happens? "quickJob" does not get executed until Server A's original timeOut expires.

This happens because a server which thinks it's in control is not regularly checking the dominator collection, so it doesn't know that another server has updated the job queue and taken control - it's just sat waiting for its timeOut to expire.

What should happen? Server A should see that Server B has crashed, take back control of the job queue and set a timeOut for "quickJob".

If there was a 3rd server then Server C would be regularly checking and would take control when it sees that no other server is in control. So this bug can only happen with 2 servers.