woodpecker-ci / woodpecker

Woodpecker is a simple yet powerful CI/CD engine with great extensibility.
https://woodpecker-ci.org
Apache License 2.0
3.89k stars 345 forks source link

`FailedAttachVolume` warning when moving between workflow steps with Kubernetes backend #3869

Open ChrisMcD1 opened 6 days ago

ChrisMcD1 commented 6 days ago

Component

agent

Describe the bug

When I observe a workflow moving from one step to another, there is always a FailedAttachVolume event. This event is resolved 15 seconds later with a SuccessfulAttachVolume.

Steps to reproduce

  1. Set up woodpecker with a kubernetes backend
  2. Run a multi-step workflow
  3. Observe event and successful retry

Expected behavior

When moving between workflow steps, the woodpecker agent should gracefully transition the persistent volume between pods. If such a graceful transition is not possible, a faster retry than 15 seconds would make sense to speed up build times. Currently, for a typical 5 minute, 5 step build, 20% of the total execution time is waiting for this error to resolve.

System Info

{"source":"https://github.com/woodpecker-ci/woodpecker","version":"next-9986032e90"}

Additional context

Here are the typical event logs from the pod of a second step in a run.

│   Type     Reason                  Age   From                     Message                                                                                                                                                               │
│   ----     ------                  ----  ----                     -------                                                                                                                                                               │
│   Normal   Scheduled               18s   default-scheduler        Successfully assigned woodpecker/wp-01j1wyt0083739k7tyj3189xdr to i-061fa8dcad61a2f31                                                                                 │
│   Warning  FailedAttachVolume      18s   attachdetach-controller  Multi-Attach error for volume "pvc-8fcd69ba-f4ac-4c3e-88d1-ee81b5a7529f" Volume is already exclusively attached to one node and can't be attached to another          │
│   Normal   SuccessfulAttachVolume  4s    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-8fcd69ba-f4ac-4c3e-88d1-ee81b5a7529f"                                                                                   │
│   Normal   Pulling                 3s    kubelet                  Pulling image "<image-name>"                                                                                                                │
│   Normal   Pulled                  3s    kubelet                  Successfully pulled image "<image-name>" in 659ms (659ms including waiting)                                                                 │
│   Normal   Created                 2s    kubelet                  Created container wp-01j1wyt0083739k7tyj3189xdr                                                                                                                       │
│   Normal   Started                 2s    kubelet                  Started container wp-01j1wyt0083739k7tyj3189xdr     

I am not very experienced with Kubernetes, and it is entirely possible that this is not a problem with woodpecker itself. From this code it appears that the triggering event to move from one step to another is straightforward. But perhaps that could be altered to check that the PV has been detached before continuing?

Validations

zc-devs commented 5 days ago

Set up woodpecker with a kubernetes backend

WP 2.6.0, K3s

Run a multi-step workflow

skip_clone: true

steps:
  one:
    image: alpine
    commands:
      - echo One
  two:
    image: alpine
    commands:
      - echo Two
  three:
    image: alpine
    commands:
      - echo Three

Observe event and successful retry

> kubectl get events --watch-only -n woodpecker
LAST SEEN   TYPE      REASON                            OBJECT                                                          MESSAGE
0s          Normal    WaitForFirstConsumer              persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   waiting for first consumer to be created before binding
0s          Normal    ExternalProvisioning              persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s          Normal    ExternalProvisioning              persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s          Normal    Provisioning                      persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   External provisioner is provisioning volume for claim "woodpecker/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default"
0s          Normal    ExternalProvisioning              persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   Waiting for a volume to be created either by the external provisioner 'rancher.io/local-path' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
0s          Normal    ProvisioningSucceeded             persistentvolumeclaim/wp-01j1yjd0n61qy1g7d0npcb9ba9-0-default   Successfully provisioned volume pvc-e7a1178d-a9a5-4b50-af13-33bed0c84d7e
0s          Normal    Scheduled                         pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13aqwzkyrx to node-1
0s          Warning   FailedToRetrieveImagePullSecret   pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s          Normal    Pulling                           pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Pulling image "alpine"
0s          Normal    Pulled                            pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Successfully pulled image "alpine" in 712ms (712ms including waiting)
0s          Normal    Created                           pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Created container wp-01j1yjd0n70n6n4z13aqwzkyrx
0s          Normal    Started                           pod/wp-01j1yjd0n70n6n4z13aqwzkyrx                               Started container wp-01j1yjd0n70n6n4z13aqwzkyrx
0s          Normal    Scheduled                         pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13at6a20qr to node-1
0s          Warning   FailedToRetrieveImagePullSecret   pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s          Normal    Pulling                           pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Pulling image "alpine"
0s          Normal    Pulled                            pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Successfully pulled image "alpine" in 709ms (710ms including waiting)
0s          Normal    Created                           pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Created container wp-01j1yjd0n70n6n4z13at6a20qr
0s          Normal    Started                           pod/wp-01j1yjd0n70n6n4z13at6a20qr                               Started container wp-01j1yjd0n70n6n4z13at6a20qr
0s          Normal    Scheduled                         pod/wp-01j1yjd0n70n6n4z13axwf8948                               Successfully assigned woodpecker/wp-01j1yjd0n70n6n4z13axwf8948 to node-1
0s          Warning   FailedToRetrieveImagePullSecret   pod/wp-01j1yjd0n70n6n4z13axwf8948                               Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.
0s          Normal    Pulling                           pod/wp-01j1yjd0n70n6n4z13axwf8948                               Pulling image "alpine"
0s          Normal    Pulled                            pod/wp-01j1yjd0n70n6n4z13axwf8948                               Successfully pulled image "alpine" in 695ms (695ms including waiting)
0s          Normal    Created                           pod/wp-01j1yjd0n70n6n4z13axwf8948                               Created container wp-01j1yjd0n70n6n4z13axwf8948
0s          Normal    Started                           pod/wp-01j1yjd0n70n6n4z13axwf8948                               Started container wp-01j1yjd0n70n6n4z13axwf8948
0s          Warning   FailedToRetrieveImagePullSecret   pod/wp-01j1yjd0n70n6n4z13axwf8948                               Unable to retrieve some image pullsecrets (regcred); attempting to pull the image may not succeed.

^ There is no FailedAttachVolume event, cannot reproduce.


Is the "bug" gone, if you schedule pods on the same Node?

skip_clone: true

steps:
  one:
    image: alpine
    commands:
      - echo One
    backend_options:
      kubernetes:
        nodeSelector:
          kubernetes.io/hostname: <name of node>
  two:
    image: alpine
    commands:
      - echo Two
    backend_options:
      kubernetes:
        nodeSelector:
          kubernetes.io/hostname: <name of node, equals to value in first step>

FailedAttachVolume error

Type       Reason
Warning    FailedAttachVolume

BTW, warning, not error.


... to check that the PV has been detached before continuing?

https://github.com/woodpecker-ci/woodpecker/blob/e5f3e67bf282bd938b5ab39295f738e2fefa7d4a/pipeline/pipeline.go#L228-L285

Current behavior

  1. Woodpecker waits for step is finished
  2. Woodpecker deletes the Pod
  3. Woodpecker creates next step Pod
  4. Kubernetes waits for volume detaching (15s?)
  5. Kubernetes attaches volume to new Pod

Proposed behavior

  1. Woodpecker waits for step is finished
  2. Woodpecker deletes the Pod
  3. Woodpecker waits for volume detaching (15s?)
  4. Woodpecker creates next step Pod
  5. Kubernetes waits for volume detaching (0s?)
  6. Kubernetes attaches volume to new Pod

What is the difference between examples in time aspect?

ChrisMcD1 commented 1 day ago

Thank you for the suggestions!

I was able to confirm that the FailedAttachVolume warning only shows up when the two steps move between different nodes.

I can replicate it using this workflow:


steps:
  one:
    image: alpine
    commands:
      - echo One
    backend_options:
      kubernetes:
        nodeSelector:
          kubernetes.io/hostname: i-0281e9aa573e5996b
  two:
    image: alpine
    commands:
      - echo Two
    backend_options:
      kubernetes:
        nodeSelector:
          kubernetes.io/hostname: i-051578473c45c10e9

I observe:

❯ kubectl get events --watch-only -n woodpecker
LAST SEEN   TYPE     REASON                   OBJECT                              MESSAGE
0s          Normal   SuccessfulAttachVolume   pod/wp-01j29w1fjbwxnwhgf29r2ekw42   AttachVolume.Attach succeeded for volume "pvc-8a481915-e800-42a2-b716-b5e402c06b9f"
0s          Normal   Pulling                  pod/wp-01j29w1fjbwxnwhgf29r2ekw42   Pulling image "alpine"
0s          Normal   Pulled                   pod/wp-01j29w1fjbwxnwhgf29r2ekw42   Successfully pulled image "alpine" in 765ms (765ms including waiting)
0s          Normal   Created                  pod/wp-01j29w1fjbwxnwhgf29r2ekw42   Created container wp-01j29w1fjbwxnwhgf29r2ekw42
0s          Normal   Started                  pod/wp-01j29w1fjbwxnwhgf29r2ekw42   Started container wp-01j29w1fjbwxnwhgf29r2ekw42
0s          Normal   Scheduled                pod/wp-01j29w1fjbwxnwhgf29vbz9pzh   Successfully assigned woodpecker/wp-01j29w1fjbwxnwhgf29vbz9pzh to i-051578473c45c10e9
0s          Warning   FailedAttachVolume       pod/wp-01j29w1fjbwxnwhgf29vbz9pzh   Multi-Attach error for volume "pvc-8a481915-e800-42a2-b716-b5e402c06b9f" Volume is already exclusively attached to one node and can't be attached to another

The bug also disappeared on my real multi-step workflow when I hard-coded the specific node id. In our cluster though, all of the nodes are auto-scaling, so a particular node id is not guaranteed to exist at any one time. Most of our workflows have all of their steps all set to target:

backend_options: 
  kubernetes:
    tolerations: 
    - effect: NoSchedule
      key: turbo
      value: 'true'
    resources:
      requests:
        cpu: 2000m
        memory: 4000M

Where the only requirement is that some resources are available, and that they go on a node with a turbo taint that is used for our beefier nodes in the cluster. These backend options are set identically on every step, and almost always frequently result in subsequent steps being set on different nodes. Should we add other settings that constrain subsequent steps to the same node as the prior step?


On the point of current and proposed behavior, this is likely due to some ignorance on my part on how k8s volumes attach and detach, but I was assuming that the volume detaching was happening much sooner than 15 seconds. And that because it had actively failed to attach, there was a reasonable backoff of 15 seconds before a retry was made. That would make sense for a typical k8s microservice backend scenario where volumes being moved between nodes would not happen very frequently. So my belief of current versus proposed behavior (that I am not sure how I would validate) is:

Current behavior

  1. Woodpecker waits for step is finished
  2. Woodpecker deletes the Pod
  3. Woodpecker creates next step Pod
  4. Kubernetes attempts to attach volume and fails
  5. Kubernetes schedules a retry in 15 seconds
  6. Volume is available 500 ms later
  7. 15 seconds later Kubernetes attempts to attach the volume and succeeds.

Proposed behavior

  1. Woodpecker waits for step is finished
  2. Woodpecker deletes the Pod
  3. Woodpecker waits for volume detaching (500ms?)
  4. Woodpecker creates next step Pod
  5. Kubernetes attaches volume to new Pod
zc-devs commented 1 day ago

Kubernetes schedules a retry in 15 seconds Volume is available 500 ms later

This should be validated.

Woodpecker waits for volume detaching (500ms?)

Also, I'm not sure, that 500ms/15s (for example) is detaching time alone. It could be 10ms for detaching and 490ms for migration of data between nodes - attaching.

Should we add other settings that constrain subsequent steps to the same node as the prior step?

There was an attempt.

zc-devs commented 23 hours ago

I've reread #3345... It was implemented for RWX and the affinity was put to the "parallel" Pod (WP service, detached step).

However, nothing stops us to widen up this approach and save the Node, where the first Pod was run, then add affinity to that Node for subsequent Pods. Like https://github.com/woodpecker-ci/woodpecker/pull/3345#issuecomment-1936029957, but taking the info from the first Pod. Moreover, we do not need to grep Pod for the info separately, we already listening to Pod updates.

Or we can stick Pods to the certain Node like K3s' Local path provisioner does. Persistent Volumes Node Affinity.

That said, it requires Workflow-level backend options to be implemented first, if we want it to be adjustable per-workflow.


Considering that it is not an error, but warning, I would qualify this not as bug, but feature request/enhancement.