in k8s after unmount directory shows "Transport endpoint is not connected" and cannot be unmounted/removed with fusermount3

yandex-cloud / geesefs

Finally, a good FUSE FS implementation over S3

Other

691 stars 45 forks source link

in k8s after unmount directory shows "Transport endpoint is not connected" and cannot be unmounted/removed with fusermount3 #83

Closed ctappy closed 1 year ago

ctappy commented 1 year ago

after an unmount in k8s with geesefs the globalmount directory shows Transport endpoint is not connected

geesefs is not running
fusermount3/fusermount -uz cannot unmount the directory
lsof doesn't show anything useful

Once I see this issue the node/server cannot be used and needs to be removed. not sure if this is something that can be resolved with geesefs because it could be a bad unmount with k8s and a system problem.

another issue is some containers will share the mount, so we can't use lifecycles in pods to unmount because other containers would be disrupted. only when the mount is not being used by any containers the geesefs process is killed and occasionally causes this issue

any thoughts?

vitalif commented 1 year ago

Hi,

Such broken mounts can be unmounted with just regular umount (after all processes using the mount exit) and mean that geesefs crashed or was killed with SIGKILL. I tried to fix this issue on csi-s3 side several times but it seems that fixes are still not enough because k8s does some additional checks on the mountpoint first... Also it seems that there's some timeout-related issue, but as I understand it reproduces 1 time out of 50 so diagnosing it is hard :)

I also tried to add -o auto_unmount support to the FUSE binding of GeeseFS (in the same way as it's implemented in libfuse) but I didn't succeed yet. I'll make another attempt some time I think :)

And I can add an option to disable mount sharing of course, but I implemented it on purpose, I thought it was a feature, not a bug :))

ctappy commented 1 year ago

I really like the share feature, with the memory usage in Geesefs, and what I am trying to do, it's helpful. 1 in 50 sounds right, when I was using another OS, I don't recall, it happened more often which could of been an older libfuse or k8s setup.

I was thinking about creating a symlink workaround method in the csi-s3 - creating a randomly/incrementally named symlink/directory on mount and on unmount removing it and ignoring errors(this issue).

ctappy commented 1 year ago

@vitalif if I were to start working on creating an incremental directory setup with mount issue checks and then create a new incremental directory, where would be a good place to start? I may start working on this in about a month or two maybe more but any advice would be great. Thanks and awesome project!

ctappy commented 1 year ago

I updated to the latest csi, was on 0.34, to the latest and saw the systemd service changes that potentially could resolve this. For now I am closing. Thanks!