xtuml / munin

Apache License 2.0
1 stars 0 forks source link

JM / AEO race condition on worker registration #224

Closed cortlandstarrett closed 3 months ago

cortlandstarrett commented 3 months ago

As part of #219 (Application Messaging Overrun), we run a scenario where audit events are already waiting in the message broker when the application launches. This leads to the following race condition.

The worker then deregisters and re-registers and sometimes syncs back up.

cortlandstarrett commented 3 months ago

I note that JobManager.selectWorkerForJob does not interrogate the 'working' attribute. Adding this interrogation could add an additional condition to the assignment.

cortlandstarrett commented 3 months ago

In worker state 'Registered' the heartbeat timer could be set to 0 seconds rather than a full duration to cause the very first heartbeat to occur quickly. This way JM could use the heartbeat to move the EmployedWorker state machine along.

cortlandstarrett commented 3 months ago

{"timestamp":"2024-06-11T18:00:44.339Z","payload":{"eventId":"","eventName":"","jobId":"","jobName":"","message":"received event for unregistered worker","tag":"aeordering_rcvd_unregistered","workerId":"1ab1cf0b-7a4e-4466-a9eb-d07e8bac13f7"}}

cortlandstarrett commented 3 months ago

fixed with #225