Agent interrupted builds restarted - is this desired/configurable?

woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.

https://woodpecker-ci.org

Apache License 2.0

4.26k stars 369 forks source link

Agent interrupted builds restarted - is this desired/configurable? #220

Open alexef opened 3 years ago

alexef commented 3 years ago

Our problem:

a build can take 10-15 minutes
agent run as kubernetes pods
if an agent dies immediately (no graceful termination, for instance the node instantly goes away) and the build is running, the build stays in a "running" state until the timeout kicks in (by default 60 minutes)
instead of finishing as a failure (timeout), the build is rescheduled on a different agent, and starts over; it has the same build number and build steps run again as the same "original" build

This breaks flows where we assume the build number to be unique (as Restarting from the UI normally generates a new build number).

Any chance this can be configured?

laszlocph commented 3 years ago

The issue that I usually see is the agent never timing out, and the build stuck in a running state.

You see something different, which is actually more desirable I think, since Drone's job is to run the job even if agents die. It just assigns the job to a new agent

This breaks flows where we assume the build number to be unique

I think the build number is unique in a different sense: unique to incoming events that trigger the build. If the agent dies, the job is still a valid work item to be done.

I would say this is by design, and probably should be like this.

What issue it is causing for you?

alexef commented 3 years ago

I see your point, makes sense.

In our case, the issue is with the order of builds and deployments:

build1 starts at t0 and is affected by this issue; after t0+60 minutes it is restarted and this time it will finish
build2 (containing newer code as build1) starts at t0+1, is not affected, finishes and deploys at t0+2
now build1 finished and deploys again, essentially doing a rollback of build2 which was already deployed while build1 was hanging

This is why we would prefer build1 to timeout AND die. This was the behaviour in drone 0.8.

Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.

laszlocph commented 3 years ago

Oops, that is a real issue.

This was the behaviour in drone 0.8.

I actually didn't remember how Drone behaved in this situation. Or to be more precise, I could never pin it down, as sometimes it was flaky in case of agent restarts.

Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.

This could work, and I remember code trying to achieve this. I think the issue could be closed if restarts are confirmed/fixed to be handled as you proposed.

6543 commented 1 month ago

currently the agent report to the server if a workflow currently running is still alive: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/agent/runner.go#L121-L125 https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L190-L200

if the agent missed this, the queue will reschedule it: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L329-L337

also this check is happening to slow: every minute an agent tells the queue and after 10 minutes it's rescheduled, this will be changed by #4114

6543 commented 1 month ago

sidenote: in normal circumstances a running workflow should never expire ... this can only happen if something got horrible wrong at the agent side and in this case the agent bug should be fixed.

so i'm not sure if we should add a flag that prevents resubmit and just make the agent more robust