Fix pipeline cancelling

qwerty287 commented 8 months ago

Component

server, agent

Describe the bug

This is mainly a summary issue of https://github.com/woodpecker-ci/woodpecker/issues/833, https://github.com/woodpecker-ci/woodpecker/issues/2062 and https://github.com/woodpecker-ci/woodpecker/issues/2911

I've been trying to debug this without real success.

I've been using the local backend, and can do the following observations:

cancel pipeline while running: completely broken. The commands are finished, the step is marked as success, the pipeline too (https://github.com/woodpecker-ci/woodpecker/issues/2911)
cancelling a pending pipeline seems to work for me

On ci.woodpecker-ci.org, I can see (uses docker backend):

cancel pending pipeline, agent is available: the pipeline starts anyways (this probably is #2062)
cancel running pipeline: works in general, but new status is failing but should be killed

System Info

next

Additional context

No response

Validations

[X] Read the Contributing Guidelines.
[X] Read the docs.
[X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
[X] Check that this is a concrete bug. For Q&A join our Discord Chat Server or the Matrix room.

zc-devs commented 6 months ago

Woodpecker 2.1.1, Kubernetes.

cancel pending pipeline: removed from queue, released resources, killed pipeline status, skipped step status Screenshot 2024-01-08 1
cancel running pipeline: removed from queue, released resources, success pipeline and step statuses Screenshot 2024-01-08 2

zc-devs commented 3 months ago

https://github.com/woodpecker-ci/woodpecker/issues/2253#issuecomment-2076542998

fernandrone commented 2 months ago

I've got a related issue, which is somewhat worrisome.

I was able to reproduce the original buck on a 2.3.0 installation with Kubernetes backend. I've observed it's inconsistent: sometimes cancelling will correctly show the running step as killed/cancelled and mark the pipeline as canceled. The last step to run will show "Oh no, we got some errors! Canceled" (remaining steps in the same workflow will show as grey, with the message "This step has been canceled."). Sometimes, it will show the last step to run as successful instead (and remaining steps in the same workflow will also show as grey, with the message "This step has been canceled.").

However if you have a second workflow that depends on the first (i.e. a multi-workflow pipeline, for example ./.woodpecker/a.yml and ./woodpecker/b.yml and "b" depends_on "a"), if workflow "a" is cancelled and we get the bug where its considered successful, than "b" will start running, and we will not have any way to cancel "b", because the cancel button will have been replaced by a Restart button ❗ This could lead to situations where an erroneous deployment is triggered and a developer is unable to stop it, for example.

⚠️ One thing I noticed is that, consistently, if I cancelled the pipeline between steps, that is, while a pod was in the Pending state (in other words, after a step was finished, but before the logs of a new step started to stream), the bug would occur and the pipeline would be marked as successful. However, if I were to cancel it while a step is in mid-execution (so I'm certain that a Pod was in the Running state) then the step would always cancel properly, marking the step and the whole Workflow as failed. Of course, this only applies and has only been tested on the Kubernetes backend.

I'd share links/screenshots but this all happened within our internal servers.

woodpecker-ci / woodpecker