Closed pi3ch closed 6 months ago
I also use a RWX (EFS on AWS) and I don't see such long clone times. In the end, this is likely an issue of the respective RWX driver or storage backend you're using.
Here is the description of the storage class
Name: efs-sc-delete
IsDefaultClass: No
Annotations: meta.helm.sh/release-name=aws-efs-csi-driver,meta.helm.sh/release-namespace=aws-efs-csi-driver
Provisioner: efs.csi.aws.com
Parameters: basePath=/dynamic_provisioning,directoryPerms=700,fileSystemId=fs-05b518491300978f2,provisioningMode=efs-ap
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
I used the OpenEBS NFS provisioner. From the logs I posted earlier, I can see volume is immediately available to the build pod.
Provisioner: openebs.io/nfsrwx
Parameters: <none>
AllowVolumeExpansion: <unset>
MountOptions: <none>
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
I did further benchmark, using RWO volume. I use the following pipeline definition and tested using both WoodPecker:2.0.0 and Drone:2.21.0 on the same cluster. Drone completes the tasks in 23s but it takes over a minutes for Woodpecker.
steps:
- name: Init
image: alpine
commands:
- echo "hello init"
- sleep 5
- name: Fail test
image: alpine
commands:
- id
- sleep 5
- env
- sleep 5
- exit 1
Woodpecker timing:
Drone timing:
So two things I noticed:
Thanks for the additional tests.
To make the timings meaningful, the image pull times are needed. They need to be subtracted from the overall time.
Drone doesn't mount a volume to the build pod. Is there a way to disable default volume mounting in the woodpecker's build pod?
It doesn't seem possible at the moment. The volume mount is hard-coded here and not managed within the helm chart: https://github.com/woodpecker-ci/woodpecker/blob/f5d9c92824713f548cf5555177bf4d6b816d81bf/pipeline/backend/kubernetes/pod.go#L162-L173 I think it's required for sharing output between steps, i.e. the cloned repo from the "clone" step to all further steps. I am not sure one can turn it off. If drone does without a PV, it would be interesting to know how they do it (i.e. porting over the cloned assets to the other steps).
Woodpecker waits longer to find the pod is in ready state before running the command. Is there a way to change the frequency of this check?
The chart defines the readiness as follows
readinessProbe:
httpGet:
path: /healthz
port: 8000
By default the checks are executed every 10s but these options are not exposed within chart right now. However, I would assume that this is not the issue but some process during pod startup (maybe the mounting process) might cause some delay.
FWIW, a "normal" clone step in most of my repos takes ~9s. With a RWX. I assume that the issue might be the RWX driver you're using and not in the general use of RWX (otherwise my timings would be higher as well).
Yet of course the differences to Drone are interesting. But before going further we need verification regarding the subtraction of image pull times.
Here are RWO storage class timing. From my read, it usually takes 10-15s for volume to attach. So the additional delay in getting the commands running as compare with Drone, seems to be related to volume bounding delay.
LAST SEEN TYPE REASON OBJECT MESSAGE
15s Warning FailedScheduling pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
12s Warning FailedScheduling pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
10s Normal Scheduled pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone Successfully assigned woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone to bn-amd-8gbram-4vcpu-x4nk2
5s Normal SuccessfulAttachVolume pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone AttachVolume.Attach succeeded for volume "pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d"
4s Normal Pulled pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
4s Normal Created pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone Created container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
4s Normal Started pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone Started container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
15s Normal ExternalProvisioning persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default waiting for a volume to be created, either by external provisioner or manually created by system administrator
15s Normal Provisioning persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default External provisioner is provisioning volume for claim "woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default"
13s Normal ProvisioningSucceeded persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default Successfully provisioned volume pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d
RWX storage class timing. taking approximately 2m 2s to get the volume attached.
LAST SEEN TYPE REASON OBJECT MESSAGE
2m1s Warning FailedScheduling pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2m Warning FailedScheduling pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
118s Normal Scheduled pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone Successfully assigned woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-clone to bn-amd-8gbram-4vcpu-x4nkp
106s Normal Pulled pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
106s Normal Created pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone Created container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
106s Normal Started pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone Started container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
2m2s Normal Provisioning persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default External provisioner is provisioning volume for claim "woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-default"
2m2s Normal ExternalProvisioning persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default waiting for a volume to be created, either by external provisioner "openebs.io/nfsrwx" or manually created by system administrator
2m Normal ProvisioningSucceeded persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default Successfully provisioned volume pvc-ad9f928a-3571-4588-a046-138f976c456d
Once volume is attached to the pod, other steps, running as containers in the same pod, use the already attached volume. Hence the timing is less than the clone
step.
I don't know why it is necessary to have PVC for sharing data between containers of the same pod, couldn't we just use a temporary/ephemeral memory volume that is shared. Probably that makes it faster and less costly.
FWIW, Drone creates EmptyDir/memory work volume: https://github.com/drone-runners/drone-runner-kube/blob/b57d0a9c80b0df00cfb92d59e3f965e8a29ac987/engine/compiler/compiler.go#L184
workVolume := &engine.Volume{
EmptyDir: &engine.VolumeEmptyDir{
ID: random(),
Name: workMount.Name,
},
}
Thanks for the stats.
As mentioned in https://github.com/woodpecker-ci/woodpecker/issues/1594#issuecomment-1876760945, every step uses its own pod. Which is why a volume is required.
Both your RWO and RWX attach times are really high 😮 My RWX attach time is lower than your RWO 🤯
I haven't done research yet what might cause this but it seems to me that something is not so good WRT to your volume drivers or other related components. Indeed, when it takes that long, WP or any other app is annoying to use.
Leaving the emptyDir
/one-pod/many-pod discussion aside, I still believe you are facing a local issues in your cluster which is responsible for the high attach times. Otherwise we would have seen other requests already and I also haven't faced such high timings in different clusters I have WP running on (all AWS/Azure).
I can help you debugging it but I think we can't do anything in the chart here. If we go for the one-pod/multiple-containers approach, this needs changes in the WP k8s backend code as mentioned in https://github.com/woodpecker-ci/woodpecker/issues/1594#issuecomment-1876760945.
Yeah make total sense. I am asking DO support to understand if this is normal or something wrong with our cluster. I did further testing using basic deployments and it took 2m for RWX to get attached.
I used the helm chart to deploy woodpecker server and agent.
With `WOODPECKER_BACKEND_K8S_STORAGE_CLASS: "nfs-rwx-storage"`` which is RWX storage class, the clone step unusually takes long time (~2m) to complete. When I disable RWX, it take less than 20s to get completed.
I can see it logs, volume mount is immediateltly available to the build pod:
Has any one experienced this? Any way to disable volume mounting?