Recently we have seen process that were stuck in status CREATED or in status RESUMED. Also we have seen tasks in flower that were hanging on status STARTED but we could not find the related process id in the orchestrator.
Presumably this happens because the process is updated in the db first (to the CREATED/RESUMED state) and the celery task is triggered afterwards. If this fails for some reason, the process will be stuck as no worker will pick it up. (This cause is still conjecture at the time of writing this report).
Possible solution(s)
There are multiple improvements we can make to the current flow.
Don't commit the process status update to the database if the task has not been acknowledged by celery. Refactor celery._start_process (and similar functions) to trigger the celery task inside a transaction and rollback the transaction if this fails.
Tweak retry options when started a Celery task.
Make Redis connection in general more resilient. The redis-py library has options to automatically reconnect/retry, but this doesn't seem to be enabled by default.
Version
1.3.0
What python version are you seeing the problem on?
What happened?
Description
Recently we have seen process that were stuck in status CREATED or in status RESUMED. Also we have seen tasks in flower that were hanging on status STARTED but we could not find the related process id in the orchestrator.
Presumably this happens because the process is updated in the db first (to the CREATED/RESUMED state) and the celery task is triggered afterwards. If this fails for some reason, the process will be stuck as no worker will pick it up. (This cause is still conjecture at the time of writing this report).
Possible solution(s)
There are multiple improvements we can make to the current flow.
Don't commit the process status update to the database if the task has not been acknowledged by celery. Refactor
celery._start_process
(and similar functions) to trigger the celery task inside a transaction and rollback the transaction if this fails.Tweak retry options when started a Celery task.
Make Redis connection in general more resilient. The redis-py library has options to automatically reconnect/retry, but this doesn't seem to be enabled by default.
Version
1.3.0
What python version are you seeing the problem on?
All