woodpecker-ci / woodpecker

Woodpecker is a simple yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
3.94k stars 353 forks source link

Fix pipeline cancelling #2875

Open qwerty287 opened 8 months ago

qwerty287 commented 8 months ago

Component

server, agent

Describe the bug

This is mainly a summary issue of https://github.com/woodpecker-ci/woodpecker/issues/833, https://github.com/woodpecker-ci/woodpecker/issues/2062 and https://github.com/woodpecker-ci/woodpecker/issues/2911

I've been trying to debug this without real success.

I've been using the local backend, and can do the following observations:

On ci.woodpecker-ci.org, I can see (uses docker backend):

System Info

next

Additional context

No response

Validations

zc-devs commented 6 months ago

Woodpecker 2.1.1, Kubernetes.

zc-devs commented 3 months ago

https://github.com/woodpecker-ci/woodpecker/issues/2253#issuecomment-2076542998

fernandrone commented 2 months ago

I've got a related issue, which is somewhat worrisome.

I was able to reproduce the original buck on a 2.3.0 installation with Kubernetes backend. I've observed it's inconsistent: sometimes cancelling will correctly show the running step as killed/cancelled and mark the pipeline as canceled. The last step to run will show "Oh no, we got some errors! Canceled" (remaining steps in the same workflow will show as grey, with the message "This step has been canceled."). Sometimes, it will show the last step to run as successful instead (and remaining steps in the same workflow will also show as grey, with the message "This step has been canceled.").

However if you have a second workflow that depends on the first (i.e. a multi-workflow pipeline, for example ./.woodpecker/a.yml and ./woodpecker/b.yml and "b" depends_on "a"), if workflow "a" is cancelled and we get the bug where its considered successful, than "b" will start running, and we will not have any way to cancel "b", because the cancel button will have been replaced by a Restart button ❗ This could lead to situations where an erroneous deployment is triggered and a developer is unable to stop it, for example.

⚠️ One thing I noticed is that, consistently, if I cancelled the pipeline between steps, that is, while a pod was in the Pending state (in other words, after a step was finished, but before the logs of a new step started to stream), the bug would occur and the pipeline would be marked as successful. However, if I were to cancel it while a step is in mid-execution (so I'm certain that a Pod was in the Running state) then the step would always cancel properly, marking the step and the whole Workflow as failed. Of course, this only applies and has only been tested on the Kubernetes backend.

I'd share links/screenshots but this all happened within our internal servers.