Open alexef opened 3 years ago
The issue that I usually see is the agent never timing out, and the build stuck in a running state.
You see something different, which is actually more desirable I think, since Drone's job is to run the job even if agents die. It just assigns the job to a new agent
This breaks flows where we assume the build number to be unique
I think the build number is unique in a different sense: unique to incoming events that trigger the build. If the agent dies, the job is still a valid work item to be done.
I would say this is by design, and probably should be like this.
What issue it is causing for you?
I see your point, makes sense.
In our case, the issue is with the order of builds and deployments:
This is why we would prefer build1 to timeout AND die. This was the behaviour in drone 0.8.
Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.
Oops, that is a real issue.
This was the behaviour in drone 0.8.
I actually didn't remember how Drone behaved in this situation. Or to be more precise, I could never pin it down, as sometimes it was flaky in case of agent restarts.
Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.
This could work, and I remember code trying to achieve this. I think the issue could be closed if restarts are confirmed/fixed to be handled as you proposed.
currently the agent report to the server if a workflow currently running is still alive: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/agent/runner.go#L121-L125 https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L190-L200
if the agent missed this, the queue will reschedule it: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L329-L337
also this check is happening to slow: every minute an agent tells the queue and after 10 minutes it's rescheduled, this will be changed by #4114
sidenote: in normal circumstances a running workflow should never expire ... this can only happen if something got horrible wrong at the agent side and in this case the agent bug should be fixed.
so i'm not sure if we should add a flag that prevents resubmit and just make the agent more robust
Related issues: https://github.com/woodpecker-ci/woodpecker/issues/178 https://github.com/woodpecker-ci/woodpecker/issues/33
Our problem:
This breaks flows where we assume the build number to be unique (as Restarting from the UI normally generates a new build number).
Any chance this can be configured?