world-federation-of-advertisers / cross-media-measurement

Apache License 2.0
33 stars 11 forks source link

Mills cannot claim tasks due to failed Computation that is accidentally enqueued #1626

Closed renjiezh closed 1 month ago

renjiezh commented 1 month ago

Describe the bug During the stress test in halo-cmm-dev, the issue has been found when multiple mills running and the mill lock duration is short. From the log, all mills were trying to claim task but none of them made it. Because of that, the entire duchy was halted and no progress could be made.

Steps to reproduce

  1. In k8s configuration, set the mill lock duration to a small value, e.g., 30 second.
  2. Create more than one replica of Mills in a duchy.
  3. Deploy in cloud environment and create a LLv2 R/F measurement.

Component(s) affected Duchy

Version HEAD, v0.5.x, v0.4.x

Environment halo-cmm-dev

Additional context This bug can stop all mills to claim tasks so that the entire system is halted.

Root cause:

  1. When the stage attempts exceeds the limit, mill should return immediately after failComputation() instead of processing it one more time.
  2. The enqueueComputation should not enqueue Computations with ending stage.

Explanation: Without the return, the mill will keep processing the stage even though failComputation has been called. The processing will eventually fail because of the token version mismatch(failComputation has updated the token to ending stage). During handleExcaption for the mismatch, the mill will get the latest token and use it to handle exception. (Note, the logic here is a bit off because the exception is caused by the previous token but the latest token might have different state.) Function handleException checks the number of attempts. However, this latest token is in ending stage forwarded by failComputation previously. So handleException sees the attempts is 1 instead of exceeding the limit. It then enqueues the Computation of ending stage and all mills will try to claim it. Claiming the failed computation is an abyss and the system will never get out without manual intervention.

Diagnosis: Run SQL query against the duchy's database

SELECT *
FROM Computations
WHERE ComputationStage=13 AND LockExpirationTime IS NOT NULL

If the query returns any result, the issue happens.

Manually Solution: Run SQL query against the duchy's database

UPDATE Computations
SET LockExpirationTime = NULL
WHERE (Protocol=1 AND ComputationStage=13 AND LockExpirationTime IS NOT NULL)
   OR (Protocol=2 AND ComputationStage=9 AND LockExpirationTime IS NOT NULL)

This query helps to resolve the abnormal state.

SanjayVas commented 1 month ago

Fixed by #1626