using "hold" scheme on both machine may cause dead lock. two way to prevent:
Spatial way (sys util restrict)
Allow only x% of total nodes to be holden. job that may cause over holding will become yielding job.
Temporal way (temp yielding)
To prevent deadlock, add a hold threshold, if a job hold more than THRESH hours, give up the resources for a scheduling iteration to allow other job to use the resource. (if no other job take the resource, it will hold again). This way will also benefit for the scenario that holding a sub-partition blocks a large job for a long time even without any backfilling.
using "hold" scheme on both machine may cause dead lock. two way to prevent:
Spatial way (sys util restrict) Allow only x% of total nodes to be holden. job that may cause over holding will become yielding job.
Temporal way (temp yielding) To prevent deadlock, add a hold threshold, if a job hold more than THRESH hours, give up the resources for a scheduling iteration to allow other job to use the resource. (if no other job take the resource, it will hold again). This way will also benefit for the scenario that holding a sub-partition blocks a large job for a long time even without any backfilling.