woodpecker-ci / helm

This repo contains the helm charts of the Woodpecker project.
Apache License 2.0
20 stars 16 forks source link

clone step using RWX volume takes long time to complete #142

Closed pi3ch closed 6 months ago

pi3ch commented 7 months ago

I used the helm chart to deploy woodpecker server and agent.

With `WOODPECKER_BACKEND_K8S_STORAGE_CLASS: "nfs-rwx-storage"`` which is RWX storage class, the clone step unusually takes long time (~2m) to complete. When I disable RWX, it take less than 20s to get completed.

image

I can see it logs, volume mount is immediateltly available to the build pod:

image

Has any one experienced this? Any way to disable volume mounting?

pat-s commented 7 months ago

I also use a RWX (EFS on AWS) and I don't see such long clone times. In the end, this is likely an issue of the respective RWX driver or storage backend you're using.

Here is the description of the storage class

Name:                  efs-sc-delete
IsDefaultClass:        No
Annotations:           meta.helm.sh/release-name=aws-efs-csi-driver,meta.helm.sh/release-namespace=aws-efs-csi-driver
Provisioner:           efs.csi.aws.com
Parameters:            basePath=/dynamic_provisioning,directoryPerms=700,fileSystemId=fs-05b518491300978f2,provisioningMode=efs-ap
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>
pi3ch commented 7 months ago

I used the OpenEBS NFS provisioner. From the logs I posted earlier, I can see volume is immediately available to the build pod.

Provisioner:           openebs.io/nfsrwx
Parameters:            <none>
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

I did further benchmark, using RWO volume. I use the following pipeline definition and tested using both WoodPecker:2.0.0 and Drone:2.21.0 on the same cluster. Drone completes the tasks in 23s but it takes over a minutes for Woodpecker.

steps:
- name: Init
  image: alpine
  commands:
    - echo "hello init"
    - sleep 5

- name: Fail test
  image: alpine
  commands:
  - id
  - sleep 5
  - env
  - sleep 5
  - exit 1

Woodpecker timing: image

Drone timing: image

So two things I noticed:

  1. Drone doesn't mount a volume to the build pod. Is there a way to disable default volume mounting in the woodpecker's build pod?
  2. Woodpecker waits longer to find the pod is in ready state before running the command. Is there a way to change the frequency of this check?
pat-s commented 7 months ago

Thanks for the additional tests.

To make the timings meaningful, the image pull times are needed. They need to be subtracted from the overall time.

Drone doesn't mount a volume to the build pod. Is there a way to disable default volume mounting in the woodpecker's build pod?

It doesn't seem possible at the moment. The volume mount is hard-coded here and not managed within the helm chart: https://github.com/woodpecker-ci/woodpecker/blob/f5d9c92824713f548cf5555177bf4d6b816d81bf/pipeline/backend/kubernetes/pod.go#L162-L173 I think it's required for sharing output between steps, i.e. the cloned repo from the "clone" step to all further steps. I am not sure one can turn it off. If drone does without a PV, it would be interesting to know how they do it (i.e. porting over the cloned assets to the other steps).

Woodpecker waits longer to find the pod is in ready state before running the command. Is there a way to change the frequency of this check?

The chart defines the readiness as follows

          readinessProbe:
            httpGet:
              path: /healthz
              port: 8000

By default the checks are executed every 10s but these options are not exposed within chart right now. However, I would assume that this is not the issue but some process during pod startup (maybe the mounting process) might cause some delay.

pat-s commented 7 months ago

FWIW, a "normal" clone step in most of my repos takes ~9s. With a RWX. I assume that the issue might be the RWX driver you're using and not in the general use of RWX (otherwise my timings would be higher as well).

Yet of course the differences to Drone are interesting. But before going further we need verification regarding the subtraction of image pull times.

pi3ch commented 6 months ago

Here are RWO storage class timing. From my read, it usually takes 10-15s for volume to attach. So the additional delay in getting the commands running as compare with Drone, seems to be related to volume bounding delay.

LAST SEEN   TYPE      REASON                   OBJECT                                                          MESSAGE
15s         Warning   FailedScheduling         pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
12s         Warning   FailedScheduling         pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
10s         Normal    Scheduled                pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Successfully assigned woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone to bn-amd-8gbram-4vcpu-x4nk2
5s          Normal    SuccessfulAttachVolume   pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       AttachVolume.Attach succeeded for volume "pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d"
4s          Normal    Pulled                   pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
4s          Normal    Created                  pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Created container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
4s          Normal    Started                  pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Started container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
15s         Normal    ExternalProvisioning     persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   waiting for a volume to be created, either by external provisioner or manually created by system administrator
15s         Normal    Provisioning             persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   External provisioner is provisioning volume for claim "woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default"
13s         Normal    ProvisioningSucceeded    persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   Successfully provisioned volume pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d

image

RWX storage class timing. taking approximately 2m 2s to get the volume attached.

LAST SEEN   TYPE      REASON                   OBJECT                                                          MESSAGE
2m1s        Warning   FailedScheduling         pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2m          Warning   FailedScheduling         pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
118s        Normal    Scheduled                pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Successfully assigned woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-clone to bn-amd-8gbram-4vcpu-x4nkp
106s        Normal    Pulled                   pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
106s        Normal    Created                  pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Created container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
106s        Normal    Started                  pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Started container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
2m2s        Normal    Provisioning             persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   External provisioner is provisioning volume for claim "woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-default"
2m2s        Normal    ExternalProvisioning     persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   waiting for a volume to be created, either by external provisioner "openebs.io/nfsrwx" or manually created by system administrator
2m          Normal    ProvisioningSucceeded    persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   Successfully provisioned volume pvc-ad9f928a-3571-4588-a046-138f976c456d

image

Once volume is attached to the pod, other steps, running as containers in the same pod, use the already attached volume. Hence the timing is less than the clone step.

I don't know why it is necessary to have PVC for sharing data between containers of the same pod, couldn't we just use a temporary/ephemeral memory volume that is shared. Probably that makes it faster and less costly.

pi3ch commented 6 months ago

FWIW, Drone creates EmptyDir/memory work volume: https://github.com/drone-runners/drone-runner-kube/blob/b57d0a9c80b0df00cfb92d59e3f965e8a29ac987/engine/compiler/compiler.go#L184

    workVolume := &engine.Volume{
        EmptyDir: &engine.VolumeEmptyDir{
            ID:   random(),
            Name: workMount.Name,
        },
    }
pat-s commented 6 months ago

Thanks for the stats.

As mentioned in https://github.com/woodpecker-ci/woodpecker/issues/1594#issuecomment-1876760945, every step uses its own pod. Which is why a volume is required.

Both your RWO and RWX attach times are really high 😮 My RWX attach time is lower than your RWO 🤯 I haven't done research yet what might cause this but it seems to me that something is not so good WRT to your volume drivers or other related components. Indeed, when it takes that long, WP or any other app is annoying to use. Leaving the emptyDir/one-pod/many-pod discussion aside, I still believe you are facing a local issues in your cluster which is responsible for the high attach times. Otherwise we would have seen other requests already and I also haven't faced such high timings in different clusters I have WP running on (all AWS/Azure).

I can help you debugging it but I think we can't do anything in the chart here. If we go for the one-pod/multiple-containers approach, this needs changes in the WP k8s backend code as mentioned in https://github.com/woodpecker-ci/woodpecker/issues/1594#issuecomment-1876760945.

pi3ch commented 6 months ago

Yeah make total sense. I am asking DO support to understand if this is normal or something wrong with our cluster. I did further testing using basic deployments and it took 2m for RWX to get attached.