Hi! There's a bug handling systemd units which breaks storage access with the following error: "Transport endpoint is not connected".
Steps to reproduce
Working with two pods using the same PVC on the same Node:
Create first pod and wait unit it's available
Delete created pod
Create second pod before the systemd unit (which was created for the first pod) is terminated
4.The pod attaches to the same unit and fails to access storage when it's finally stopped
Root cause
NodeUnstageVolume initiates the unmount by stopping the systemd unit and returns without waiting for the operation to complete. So when a new pod is started, NodeStageVolume and NodePublishVolume check that the stage mount point already exists (but it's under unmouning at the same time) and attach to it. After a while, the systemd unit dies, and the new pod loses access to the FS.
We often see the behavior because systemd unit is stuck on termination (for some reason) until timeout and is being killed.
Possible solutions
Wait until mount point is unmounted before NodeUnstageVolume exits:
resCh := make(chan string)
_, err = conn.StopUnit(unitName, "replace", resCh)
if err != nil {
return false, err
}
res := <-resCh // wait until unit is stopped
glog.Infof("Unit is stopped with result: %s")
Hi! There's a bug handling systemd units which breaks storage access with the following error: "Transport endpoint is not connected".
Steps to reproduce
Working with two pods using the same PVC on the same Node:
Root cause
NodeUnstageVolume
initiates the unmount by stopping the systemd unit and returns without waiting for the operation to complete. So when a new pod is started,NodeStageVolume
andNodePublishVolume
check that the stage mount point already exists (but it's under unmouning at the same time) and attach to it. After a while, the systemd unit dies, and the new pod loses access to the FS.We often see the behavior because systemd unit is stuck on termination (for some reason) until timeout and is being killed.
Possible solutions
Wait until mount point is unmounted before
NodeUnstageVolume
exits:StopUnit
only starts operation, but doesn't wait for completion: https://github.com/yandex-cloud/k8s-csi-s3/blob/80dcfa4c021f726d67fca3e4df78e2a49a5c3e6b/pkg/mounter/mounter.go#L99Do unmount explicitly after stopping systemd unit: Replace this code fragment
ExecStopPost
for systemd unit (https://github.com/yandex-cloud/k8s-csi-s3/blob/80dcfa4c021f726d67fca3e4df78e2a49a5c3e6b/pkg/mounter/geesefs.go#L203-L216) with explicitUnmount
call:Not sure which one is better, so let's discuss)