Open renjiezh opened 2 months ago
It is caused by the spanner implementation of claimTask. The reading(query unclaimed tasks) and writing(claim the task) are not bound in one transaction. Thus there is a chance to lead inconsistency given multiple entities are calling claimTask.
Fixed by #1726
Reopening this as #1726 may have introduced a lock contention issue.
Describe the bug There are two mill jobs claiming the same Computation. One of the them is a new spawned by the mill scheduler. The other is a continuing mill job. It caused the later mill job failing the Computation after finishing its stage due to stage mismatch.
Steps to reproduce Run stress test with multiple data services. There is a chance to reproduce.
Component(s) affected Duchy
Version v0.5.7-rc2
Environment QA env
Additional context Happened on worker 1 with global ComputationID: DaTIZfrdJI4