yandex-cloud / k8s-csi-s3

GeeseFS-based CSI for mounting S3 buckets as PersistentVolumes
Other
558 stars 97 forks source link

CSI-S3 fails after a few hours of inactivity #7

Open uzhinskiy opened 2 years ago

uzhinskiy commented 2 years ago

Hello. We are trying to use CSI-S3 with geesefs as storage backend for elasticsearch. We are using this elasticsearch as a snapshot checker. Most of the time it is idle and not processing any data. We noticed that after a few hours of inactivity all IO operations in elasticsearch's pod failed with following log lines in kube-system/csi-s3-XXX:

E0329 12:21:59.786708      1 utils.go:101] GRPC error: rpc error: code = Internal desc = Unmount failed: exit status 32
Unmounting arguments: /var/lib/kubelet/pods/44d8a275-2b1d-4236-8d6a-ba6f4d709b60/volumes/kubernetes.io~csi/pvc-69e61d54-8b2a-420d-b1b3-0260b790d33e/mount
Output: umount: /var/lib/kubelet/pods/44d8a275-2b1d-4236-8d6a-ba6f4d709b60/volumes/kubernetes.io~csi/pvc-69e61d54-8b2a-420d-b1b3-0260b790d33e/mount: not mounted

After we manually restarted this pod everything was fine again. We suspect that the problem could be caused by network disruption which leads to TCP connection termination, which is not being reestablished after that network problem is gone.

How do we prevent this behavior of CSI-S3?

Thank you.

vitalif commented 2 years ago

Hi, sorry for long wait. I'm not sure what it means in your case, but geesefs itself definitely handles network failures well. CSI unmounting the volume after a period of inactivity also looks rather strange. It should at least be logged in the pod log - maybe the pod is stopped? Check it with kubectl describe pod ... Or maybe you already found the answer yourself? =)