Open ChrisMcD1 opened 6 days ago
Set up woodpecker with a kubernetes backend
WP 2.6.0
, K3s
Run a multi-step workflow
skip_clone: true
steps:
one:
image: alpine
commands:
- echo One
two:
image: alpine
commands:
- echo Two
three:
image: alpine
commands:
- echo Three
Observe event and successful retry
> kubectl get events --watch-only -n woodpecker
LAST SEEN TYPE REASON OBJECT MESSAGE
0s Normal WaitForFirstConsumer persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default waiting for first consumer to be created before binding
0s Normal ExternalProvisioning persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s Normal ExternalProvisioning persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s Normal Provisioning persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default External provisioner is provisioning volume for claim "woodpecker/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default"
0s Normal ExternalProvisioning persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s Normal ProvisioningSucceeded persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default Successfully provisioned volume pvc-e7a1178d-a9a5-4b50-af13-33bed0c84d7e
0s Normal Scheduled pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13aqwzkyrx to node-1
0s Warning FailedToRetrieveImagePullSecret pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s Normal Pulling pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Pulling image "alpine"
0s Normal Pulled pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Successfully pulled image "alpine" in 712ms (712ms including waiting)
0s Normal Created pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Created container wp-01j1yjd0n70n6n4z13aqwzkyrx
0s Normal Started pod/wp-01j1yjd0n70n6n4z13aqwzkyrx Started container wp-01j1yjd0n70n6n4z13aqwzkyrx
0s Normal Scheduled pod/wp-01j1yjd0n70n6n4z13at6a20qr Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13at6a20qr to node-1
0s Warning FailedToRetrieveImagePullSecret pod/wp-01j1yjd0n70n6n4z13at6a20qr Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s Normal Pulling pod/wp-01j1yjd0n70n6n4z13at6a20qr Pulling image "alpine"
0s Normal Pulled pod/wp-01j1yjd0n70n6n4z13at6a20qr Successfully pulled image "alpine" in 709ms (710ms including waiting)
0s Normal Created pod/wp-01j1yjd0n70n6n4z13at6a20qr Created container wp-01j1yjd0n70n6n4z13at6a20qr
0s Normal Started pod/wp-01j1yjd0n70n6n4z13at6a20qr Started container wp-01j1yjd0n70n6n4z13at6a20qr
0s Normal Scheduled pod/wp-01j1yjd0n70n6n4z13axwf8948 Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13axwf8948 to node-1
0s Warning FailedToRetrieveImagePullSecret pod/wp-01j1yjd0n70n6n4z13axwf8948 Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s Normal Pulling pod/wp-01j1yjd0n70n6n4z13axwf8948 Pulling image "alpine"
0s Normal Pulled pod/wp-01j1yjd0n70n6n4z13axwf8948 Successfully pulled image "alpine" in 695ms (695ms including waiting)
0s Normal Created pod/wp-01j1yjd0n70n6n4z13axwf8948 Created container wp-01j1yjd0n70n6n4z13axwf8948
0s Normal Started pod/wp-01j1yjd0n70n6n4z13axwf8948 Started container wp-01j1yjd0n70n6n4z13axwf8948
0s Warning FailedToRetrieveImagePullSecret pod/wp-01j1yjd0n70n6n4z13axwf8948 Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
^ There is no FailedAttachVolume
event, cannot reproduce.
Is the "bug" gone, if you schedule pods on the same Node?
skip_clone: true
steps:
one:
image: alpine
commands:
- echo One
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: <name of node>
two:
image: alpine
commands:
- echo Two
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: <name of node, equals to value in first step>
FailedAttachVolume error
Type Reason
Warning FailedAttachVolume
BTW, warning
, not error
.
... to check that the PV has been detached before continuing?
Current behavior
Proposed behavior
What is the difference between examples in time aspect?
Thank you for the suggestions!
I was able to confirm that the FailedAttachVolume
warning only shows up when the two steps move between different nodes.
I can replicate it using this workflow:
steps:
one:
image: alpine
commands:
- echo One
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: i-0281e9aa573e5996b
two:
image: alpine
commands:
- echo Two
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: i-051578473c45c10e9
I observe:
❯ kubectl get events --watch-only -n woodpecker
LAST SEEN TYPE REASON OBJECT MESSAGE
0s Normal SuccessfulAttachVolume pod/wp-01j29w1fjbwxnwhgf29r2ekw42 AttachVolume.Attach succeeded for volume "pvc-8a481915-e800-42a2-b716-b5e402c06b9f"
0s Normal Pulling pod/wp-01j29w1fjbwxnwhgf29r2ekw42 Pulling image "alpine"
0s Normal Pulled pod/wp-01j29w1fjbwxnwhgf29r2ekw42 Successfully pulled image "alpine" in 765ms (765ms including waiting)
0s Normal Created pod/wp-01j29w1fjbwxnwhgf29r2ekw42 Created container wp-01j29w1fjbwxnwhgf29r2ekw42
0s Normal Started pod/wp-01j29w1fjbwxnwhgf29r2ekw42 Started container wp-01j29w1fjbwxnwhgf29r2ekw42
0s Normal Scheduled pod/wp-01j29w1fjbwxnwhgf29vbz9pzh Successfully assigned woodpecker/wp-01j29w1fjbwxnwhgf29vbz9pzh to i-051578473c45c10e9
0s Warning FailedAttachVolume pod/wp-01j29w1fjbwxnwhgf29vbz9pzh Multi-Attach error for volume "pvc-8a481915-e800-42a2-b716-b5e402c06b9f" Volume is already exclusively attached to one node and can't be attached to another
The bug also disappeared on my real multi-step workflow when I hard-coded the specific node id. In our cluster though, all of the nodes are auto-scaling, so a particular node id is not guaranteed to exist at any one time. Most of our workflows have all of their steps all set to target:
backend_options:
kubernetes:
tolerations:
- effect: NoSchedule
key: turbo
value: 'true'
resources:
requests:
cpu: 2000m
memory: 4000M
Where the only requirement is that some resources are available, and that they go on a node with a turbo
taint that is used for our beefier nodes in the cluster. These backend options are set identically on every step, and almost always frequently result in subsequent steps being set on different nodes. Should we add other settings that constrain subsequent steps to the same node as the prior step?
On the point of current and proposed behavior, this is likely due to some ignorance on my part on how k8s volumes attach and detach, but I was assuming that the volume detaching was happening much sooner than 15 seconds. And that because it had actively failed to attach, there was a reasonable backoff of 15 seconds before a retry was made. That would make sense for a typical k8s microservice backend scenario where volumes being moved between nodes would not happen very frequently. So my belief of current versus proposed behavior (that I am not sure how I would validate) is:
Current behavior
Proposed behavior
Kubernetes schedules a retry in 15 seconds Volume is available 500 ms later
This should be validated.
Woodpecker waits for volume detaching (500ms?)
Also, I'm not sure, that 500ms/15s (for example) is detaching time alone. It could be 10ms for detaching and 490ms for migration of data between nodes - attaching.
Should we add other settings that constrain subsequent steps to the same node as the prior step?
There was an attempt.
I've reread #3345... It was implemented for RWX and the affinity was put to the "parallel" Pod (WP service, detached step).
However, nothing stops us to widen up this approach and save the Node, where the first Pod was run, then add affinity to that Node for subsequent Pods. Like https://github.com/woodpecker-ci/woodpecker/pull/3345#issuecomment-1936029957, but taking the info from the first Pod. Moreover, we do not need to grep Pod for the info separately, we already listening to Pod updates.
Or we can stick Pods to the certain Node like K3s' Local path provisioner does. Persistent Volumes Node Affinity.
That said, it requires Workflow-level backend options to be implemented first, if we want it to be adjustable per-workflow.
Considering that it is not an error, but warning, I would qualify this not as bug
, but feature request
/enhancement
.
Component
agent
Describe the bug
When I observe a workflow moving from one step to another, there is always a
FailedAttachVolume
event. This event is resolved 15 seconds later with aSuccessfulAttachVolume
.Steps to reproduce
Expected behavior
When moving between workflow steps, the woodpecker agent should gracefully transition the persistent volume between pods. If such a graceful transition is not possible, a faster retry than 15 seconds would make sense to speed up build times. Currently, for a typical 5 minute, 5 step build, 20% of the total execution time is waiting for this error to resolve.
System Info
Additional context
Here are the typical event logs from the pod of a second step in a run.
I am not very experienced with Kubernetes, and it is entirely possible that this is not a problem with woodpecker itself. From this code it appears that the triggering event to move from one step to another is straightforward. But perhaps that could be altered to check that the PV has been detached before continuing?
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]