xtuml / munin

Apache License 2.0
1 stars 0 forks source link

quicker unhappy jobs #233

Open cortlandstarrett opened 1 week ago

cortlandstarrett commented 1 week ago

At present, an unhappy job finishes after a timeout of the hanging job timer. The rationale is to allow sufficient time for a critical event to arrive, which triggers an alarm condition. This makes sense, but has the below costs.

There are a few problems with this:

  1. It reuses a timer that is intended for other purposes.
  2. It is not configurable separately from the hanging job.
  3. It it a long timer which means that unhappy jobs are held in memory for a long time increasing the number of concurrent jobs such that it could be a memory risk. At 50 jobs per second and a 30 second hanging job timer, this could expand to 1500 jobs waiting to end. This impacts our max jobs per worker setting.

It might be good to add a configuration value for this timer. Another option is to use the intra-event timer which can be very short.

A thought would be to allow the unhappy job to finish quickly, but detect critical events in the Job Gone Horribly Wrong state, which is entered if a "stray event" from a previous job arrives.

cortlandstarrett commented 1 week ago

This implies that a transition from JobFailed to HorriblyWrong is needed, so that a failed job can be upgraded to alarm when a critical event arrives after failure.

cortlandstarrett commented 1 day ago

For 1.4.0, the timer to end an unhappy job shifted from the JobHanging timer to the InvariantLoad timer.

In the future, a separate, purpose-specific timer should be added.