yandex-cloud / geesefs

Finally, a good FUSE FS implementation over S3
Other
693 stars 45 forks source link

geesefs 0.40.0 random crash during create and deleting files #101

Closed ctappy closed 9 months ago

ctappy commented 10 months ago

While converting csv files into parquet files using scala/spark, I started to delete all of the files instead of stopping the process and deleting the files. geesefs in k8s crashed without a stackdump/error message in journalctl

The files were being created on one geesefs spark worker, and on another server with geesefs i deleted them

ctappy commented 10 months ago
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.405368 fuse.DEBUG Inode.LookUp 3 dataset/1.0 [dataset/1.0/raw_data]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407494 fuse.DEBUG Ref 4 dataset/1.0/raw_data [680]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407508 fuse.DEBUG <-- LookUpInode 3 raw_data <nil>
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407530 fuse.DEBUG Op 0x0040c92e        connection.go:517] -> OK (inode 4)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407604 fuse.DEBUG Op 0x0040c930        connection.go:428] <- OpenDir (inode 4, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407649 fuse.DEBUG Op 0x0040c930        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407705 fuse.DEBUG Op 0x0040c932        connection.go:428] <- ReadDir (inode 4, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407798 fuse.DEBUG ReadDir 4 dataset/1.0/raw_data [0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407806 fuse.DEBUG Ref 4 dataset/1.0/raw_data [681]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407812 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [raw_data 0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407817 fuse.DEBUG Ref 3 dataset/1.0 [576]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.407821 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [1.0 1]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.441992 fuse.DEBUG Ref 21638815 dataset/1.0/raw_data/2017 [27]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442011 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2017 2]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442017 fuse.DEBUG Ref 21639821 dataset/1.0/raw_data/2018 [61]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442023 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2018 3]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442030 fuse.DEBUG Ref 21639822 dataset/1.0/raw_data/2019 [24]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442036 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2019 4]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442041 fuse.DEBUG Ref 21639823 dataset/1.0/raw_data/2020 [24]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442046 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2020 5]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442051 fuse.DEBUG Ref 21639824 dataset/1.0/raw_data/2021 [24]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442054 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2021 6]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442059 fuse.DEBUG Ref 21639825 dataset/1.0/raw_data/2022 [24]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442063 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2022 7]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442067 fuse.DEBUG Ref 21590565 dataset/1.0/raw_data/2023 [30]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.442071 fuse.DEBUG <-- ReadDir 4 dataset/1.0/raw_data [2023 8]

Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.443934 fuse.DEBUG ReadDir 21638815 dataset/1.0/raw_data/2017 [0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.443945 fuse.DEBUG Ref 21638815 dataset/1.0/raw_data/2017 [28]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.443951 fuse.DEBUG <-- ReadDir 21638815 dataset/1.0/raw_data/2017 [2017 0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.443959 fuse.DEBUG Ref 4 dataset/1.0/raw_data [682]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.443965 fuse.DEBUG <-- ReadDir 21638815 dataset/1.0/raw_data/2017 [raw_data 1]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476737 fuse.DEBUG Ref 21638816 dataset/1.0/raw_data/2017/2017-11 [34]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476754 fuse.DEBUG <-- ReadDir 21638815 dataset/1.0/raw_data/2017 [2017-11 2]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476761 fuse.DEBUG Ref 21641874 dataset/1.0/raw_data/2017/2017-12 [19]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476767 fuse.DEBUG <-- ReadDir 21638815 dataset/1.0/raw_data/2017 [2017-12 3]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476784 fuse.DEBUG Op 0x0040c93a        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476872 fuse.DEBUG Op 0x0040c93c        connection.go:428] <- ReadDir (inode 21638815, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476964 fuse.DEBUG ReadDir 21638815 dataset/1.0/raw_data/2017 [4]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.476980 fuse.DEBUG Op 0x0040c93c        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477029 fuse.DEBUG Op 0x0040c93e        connection.go:428] <- ReleaseDirHandle (PID 0)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477048 fuse.DEBUG ReleaseDirHandle dataset/1.0/raw_data/2017
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477065 fuse.DEBUG Op 0x0040c93e        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477099 fuse.DEBUG Op 0x025cf746        connection.go:428] <- ReadSymlink (inode 3158, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477145 fuse.DEBUG Op 0x025cf746        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477218 fuse.DEBUG Op 0x025cf748        connection.go:428] <- ReadSymlink (inode 3158, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477288 fuse.DEBUG Op 0x025cf748        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477354 fuse.DEBUG Op 0x025cf74a        connection.go:428] <- ReadSymlink (inode 3158, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093863]: 2024/01/15 11:55:56.477379 fuse.DEBUG Op 0x025cf74a        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477423 fuse.DEBUG Op 0x0040c940        connection.go:428] <- OpenDir (inode 21638816, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477445 fuse.DEBUG Op 0x0040c940        connection.go:517] -> OK ()
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477493 fuse.DEBUG Op 0x0040c942        connection.go:428] <- ReadDir (inode 21638816, PID 1811785)
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477544 fuse.DEBUG ReadDir 21638816 dataset/1.0/raw_data/2017/2017-11 [0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477559 fuse.DEBUG Ref 21638816 dataset/1.0/raw_data/2017/2017-11 [35]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477566 fuse.DEBUG <-- ReadDir 21638816 dataset/1.0/raw_data/2017/2017-11 [2017-11 0]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477571 fuse.DEBUG Ref 21638815 dataset/1.0/raw_data/2017 [29]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.477576 fuse.DEBUG <-- ReadDir 21638816 dataset/1.0/raw_data/2017/2017-11 [2017 1]
Jan 15 11:55:56 nodekw7 geesefs[1093817]: 2024/01/15 11:55:56.511710 fuse.DEBUG Ref 21638817 dataset/1.0/raw_data/2017/2017-11/2017-11-01 [5]

Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754193 fuse.DEBUG Ref 21649909 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_180000-181000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754198 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_180000-181000.txt.gz 240]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754203 fuse.DEBUG Ref 21649910 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_181000-182000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754207 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_181000-182000.txt.gz 241]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754220 fuse.DEBUG Op 0x0040cdd8        connection.go:517] -> OK ()
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754282 fuse.DEBUG Op 0x0040cdda        connection.go:428] <- ReadDir (inode 21641887, PID 1811785)
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754317 fuse.DEBUG ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [242]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754324 fuse.DEBUG Ref 21649911 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_182000-183000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754329 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_182000-183000.txt.gz 242]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754335 fuse.DEBUG Ref 21649912 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_183000-184000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754339 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_183000-184000.txt.gz 243]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754345 fuse.DEBUG Ref 21649913 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_184000-185000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754350 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_184000-185000.txt.gz 244]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754356 fuse.DEBUG Ref 21649914 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_185000-190000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754360 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_185000-190000.txt.gz 245]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754367 fuse.DEBUG Ref 21649915 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_190000-191000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754371 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_190000-191000.txt.gz 246]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754377 fuse.DEBUG Ref 21649916 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_191000-192000.txt.gz [3]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754382 fuse.DEBUG <-- ReadDir 21641887 dataset/1.0/raw_data/2017/2017-11/2017-11-15 [sample_data_#DT-F@_2017-11-15_191000-192000.txt.gz 247]
Jan 15 11:55:57 nodekw7 geesefs[1093817]: 2024/01/15 11:55:57.754386 fuse.DEBUG Ref 21649917 dataset/1.0/raw_data/2017/2017-11/2017-11-15/sample_data_#DT-F@_2017-11-15_192000-193000.txt.gz [3]
Jan 15 11:56:15 nodekw7 systemd[1]: geesefs-node_2dusr_2dd1_2d00000005.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 15 11:56:15 nodekw7 systemd[1]: var-lib-kubelet-plugins-kubernetes.io-csi-ru.yandex.s3.csi-7dc3bb9adbb8835421cb7f43f7d7d6b10d0a838aa50b616e14ccc1d2f6108f21-globalmount.mount: Deactivated successfully.
Jan 15 11:56:15 nodekw7 systemd[1]: geesefs-node_2dusr_2dd1_2d00000005.service: Failed with result 'exit-code'.
ctappy commented 10 months ago

the above journalctl message are snippets over the last few seconds before the crash, separated by spaces

vitalif commented 10 months ago

Hi, maybe this was the OOM killer? Check dmesg please. Panics should generate a backtrace...

ctappy commented 10 months ago

nothing in dmesg, I have a lot of memory in the server it is running on. I have tried recreating the issue but I was not successful. I'll close for now

ctappy commented 10 months ago

this happened again. I believe the issue is related to the nic and not geeseFS

ctappy commented 10 months ago

Jan 30 15:40:07 dapu1kw2 geesefs[3181398]: 2024/01/30 15:40:07.787447 main.INFO I/O: 2.00 read/s, 86.67 % hits, 0.10 write/s; metadata: 29.13 read/s, 0.17 write/s, 7.20 noop/s, 6457 alive, 0.00 evict/s; 0.07 flush/s Jan 30 15:40:07 dapu1kw2 geesefs[3181412]: 2024/01/30 15:40:07.978089 main.INFO I/O: 0.00 read/s, 0.00 % hits, 0.00 write/s; metadata: 0.00 read/s, 0.00 write/s, 0.00 noop/s, 1 alive, 0.00 evict/s; 0.00 flush/s

Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: panic: Unknown inode: 1554 Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: goroutine 78 [running]: Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: github.com/yandex-cloud/geesefs/internal.(Goofys).getInodeOrDie(0xc00015a6c0, 0x0?) Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: /home/runner/work/geesefs/geesefs/internal/goofys.go:437 +0x105 Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: github.com/yandex-cloud/geesefs/internal.(Goofys).Flusher(0xc00015a6c0) Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: /home/runner/work/geesefs/geesefs/internal/goofys.go:638 +0x16a Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: created by github.com/yandex-cloud/geesefs/internal.newGoofys Jan 30 15:40:18 dapu1kw2 geesefs[3181398]: /home/runner/work/geesefs/geesefs/internal/goofys.go:368 +0xa4a

hey @vitalif I can almost always recreate this issue, one in four attempts. I create about 1500 files and then remove them all with rm -rf path/to/dir

ctappy commented 10 months ago

let me know if you need debug on or anything related, thanks so much! I may put some time into this in a few weeks

vitalif commented 9 months ago

Hi, can you please try to build a version from master and retest with it?

ctappy commented 9 months ago

I should be able to get a some cycles in tomorrow on this, thanks!

ctappy commented 9 months ago

@vitalif looks really good, thanks!!