woodpecker-ci / woodpecker

Woodpecker is a simple, yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
4.26k stars 369 forks source link

Agent interrupted builds restarted - is this desired/configurable? #220

Open alexef opened 3 years ago

alexef commented 3 years ago

Related issues: https://github.com/woodpecker-ci/woodpecker/issues/178 https://github.com/woodpecker-ci/woodpecker/issues/33

Our problem:

This breaks flows where we assume the build number to be unique (as Restarting from the UI normally generates a new build number).

Any chance this can be configured?

laszlocph commented 3 years ago

The issue that I usually see is the agent never timing out, and the build stuck in a running state.

You see something different, which is actually more desirable I think, since Drone's job is to run the job even if agents die. It just assigns the job to a new agent

This breaks flows where we assume the build number to be unique

I think the build number is unique in a different sense: unique to incoming events that trigger the build. If the agent dies, the job is still a valid work item to be done.

I would say this is by design, and probably should be like this.

What issue it is causing for you?

alexef commented 3 years ago

I see your point, makes sense.

In our case, the issue is with the order of builds and deployments:

This is why we would prefer build1 to timeout AND die. This was the behaviour in drone 0.8.

Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.

laszlocph commented 3 years ago

Oops, that is a real issue.

This was the behaviour in drone 0.8.

I actually didn't remember how Drone behaved in this situation. Or to be more precise, I could never pin it down, as sometimes it was flaky in case of agent restarts.

Another way around this would be (if the agent sends a heartbeat) to detect that the agent died, and re-schedule the build1 on a new agent before the 60 minutes of timeout pass.

This could work, and I remember code trying to achieve this. I think the issue could be closed if restarts are confirmed/fixed to be handled as you proposed.

6543 commented 1 month ago

currently the agent report to the server if a workflow currently running is still alive: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/agent/runner.go#L121-L125 https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L190-L200

if the agent missed this, the queue will reschedule it: https://github.com/woodpecker-ci/woodpecker/blob/41b2127e042cb7b61c95ac57852be40a3f7d97f5/server/queue/fifo.go#L329-L337

also this check is happening to slow: every minute an agent tells the queue and after 10 minutes it's rescheduled, this will be changed by #4114

6543 commented 1 month ago

sidenote: in normal circumstances a running workflow should never expire ... this can only happen if something got horrible wrong at the agent side and in this case the agent bug should be fixed.

so i'm not sure if we should add a flag that prevents resubmit and just make the agent more robust