xtreemfs / xtreemfs

Distributed Fault-Tolerant File System
https://www.xtreemfs.org
Other
335 stars 65 forks source link

mount.xtreemfs SIGSEV #476

Open sysrqio opened 6 years ago

sysrqio commented 6 years ago

Hi,

I am currently testing if xtreemfs matches my needs. I am using the current unstable version 1.6.0.5 on ubuntu 16.04. While rsyncing files onto the share I constantly see after some time segfaults in the mount.xtreemfs.

[473216.194277] mount.xtreemfs[13795]: segfault at 0 ip 00000000009c7c0a sp 00007f416e8ed390 error 4 in mount.xtreemfs[400000+6ff000]
[473957.625582] mount.xtreemfs[15981]: segfault at 4 ip 00000000009c7c0a sp 00007fd373dca390 error 4 in mount.xtreemfs[400000+6ff000]
[1154776.682950] mount.xtreemfs[30417]: segfault at 2c ip 00000000008331bd sp 00007f5d3302d420 error 4 in mount.xtreemfs[400000+72b000]

Without debug enabled there are no interesting errors in the logs. Attached you find the backtrace of one of the core dumps.

~# gdb /usr/bin/mount.xtreemfs /core 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/mount.xtreemfs...(no debugging symbols found)...done.
[New LWP 30417]
[New LWP 30418]
[New LWP 30419]
[New LWP 30420]
[New LWP 30416]
[New LWP 30421]
[New LWP 30415]
[New LWP 30422]
[New LWP 30423]
[New LWP 30424]
[New LWP 30425]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `mount.xtreemfs -o allow_other,direct_io --enable-async-writes --vivaldi-enable'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000008331bd in xtreemfs::pbrpc::VivaldiCoordinates::MergeFrom(xtreemfs::pbrpc::VivaldiCoordinates const&) ()
[Current thread is 1 (Thread 0x7f5d3302e700 (LWP 30417))]
(gdb) bt
#0  0x00000000008331bd in xtreemfs::pbrpc::VivaldiCoordinates::MergeFrom(xtreemfs::pbrpc::VivaldiCoordinates const&) ()
#1  0x0000000000833342 in xtreemfs::pbrpc::VivaldiCoordinates::CopyFrom(xtreemfs::pbrpc::VivaldiCoordinates const&) ()
#2  0x00000000009314b7 in xtreemfs::pbrpc::VivaldiCoordinates::operator=(xtreemfs::pbrpc::VivaldiCoordinates const&) ()
#3  0x000000000093167b in xtreemfs::KnownOSD::SetCoordinates(xtreemfs::pbrpc::VivaldiCoordinates const&) ()
#4  0x000000000092fa26 in xtreemfs::Vivaldi::Run() ()
#5  0x000000000086ebbf in boost::_mfi::mf0<void, xtreemfs::Vivaldi>::operator()(xtreemfs::Vivaldi*) const ()
#6  0x000000000086eaa0 in void boost::_bi::list1<boost::_bi::value<xtreemfs::Vivaldi*> >::operator()<boost::_mfi::mf0<void, xtreemfs::Vivaldi>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, xtreemfs::Vivaldi>&, boost::_bi::list0&, int) ()
#7  0x000000000086e984 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, xtreemfs::Vivaldi>, boost::_bi::list1<boost::_bi::value<xtreemfs::Vivaldi*> > >::operator()() ()
#8  0x000000000086e8ac in boost::detail::thread_data<boost::_bi::bind_t<void, boost::_mfi::mf0<void, xtreemfs::Vivaldi>, boost::_bi::list1<boost::_bi::value<xtreemfs::Vivaldi*> > > >::run() ()
#9  0x00007f5d35c405d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#10 0x00007f5d3579b6ba in start_thread (arg=0x7f5d3302e700) at pthread_create.c:333
#11 0x00007f5d345833dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:74
#12 0x0000000000000000 in ?? ()
robert-schmidtke commented 6 years ago

Hi, thanks for bringing this up.

It's quite interesting, as the segfault originates from Protobuf generated code. Seeing you're using the Vivaldi feature, have you tried the same scenario without Vivaldi? Have you followed the user guide in setting up Vivaldi?

Cheers Robert

sysrqio commented 6 years ago

Hi Robert, I think I am using the Vivaldi Feature as designed. The share has vivaldi set as replica selection property and the mount.xtreemfs is used with the parameters --vivaldi-enable. I tried combinations with --vivaldi-enable-dir-updates. IMO the vivaldi feature fits best for my setup, which uses multiple OSDs in different locations, interconnected by a WAN link. One side is a Datacenter and the other one is a fiber link.

Using it without vivaldi in the client and the share (By setting the share to default), i do not see it core dumping. After some time I see the rsync to be stalled and when sending a SIGABRT some file transfers have the error:

rsync: mkstemp "/path/filename" failed: Transport endpoint is not connected (107) This is visible in both with or without vivaldi.

My config is mostly the default config. The main difference is that I am using grid_ssl with IPv6 only. I have seen a silimar error on the stable version, which is one of the reasons why I tried the unstable version.

robert-schmidtke commented 6 years ago

Hi,

I recall we've had some issues with the transport endpoint not connected error in the past. It was usually due to some exception during read/write, which caused the fuse client to hang and eventually die. We tried mitigating this via rigorous exception handling, but it seems there are still problems. Have you enabled the debug option in the mounting client? You can also increase the logging level at the MRC and OSDs and attach the logs to this ticket, along with the time when the issue appeared, as debug logs can grow quite large pretty quickly. Thanks for helping us debug the issue!

Cheers Robert

sysrqio commented 6 years ago

Hi,

I found these errors around the timestamp when not using vivaldi:

[ D | MRCRequestDispatcher | PBRPCSrv@32636  |  39 | Feb 22 00:59:41 ] enqueueing request: org.xtreemfs.foundation.pbrpc.server.RPCServerRequest: callid=1955778, type=RPC_REQUEST,proc=15,interf=20001
[ D | ProcessingStage      | ProcSt          |  43 | Feb 22 00:59:41 ] org.xtreemfs.mrc.UserException: No open voucher for global fileId c7ee8a65-ae8c-424a-bbcf-0512cb513e18:568879 (errno=POSIX_ERROR_EINVAL)
 ...                                           org.xtreemfs.mrc.quota.VoucherManager.addRenewedTimestamp(VoucherManager.java:512)
 ...                                           org.xtreemfs.mrc.operations.RenewCapabilityAndVoucherOperation.startRequest(RenewCapabilityAndVoucherOperation.java:87)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.execute(ProcessingStage.java:310)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.parseAndExecute(ProcessingStage.java:273)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.processMethod(ProcessingStage.java:176)
 ...                                           org.xtreemfs.mrc.stages.MRCStage.run(MRCStage.java:129)
[ D | MRCRequestDispatcher | ProcSt          |  43 | Feb 22 00:59:41 ] org.xtreemfs.mrc.UserException: No open voucher for global fileId c7ee8a65-ae8c-424a-bbcf-0512cb513e18:568879 (errno=POSIX_ERROR_EINVAL)
 ...                                           org.xtreemfs.mrc.quota.VoucherManager.addRenewedTimestamp(VoucherManager.java:512)
 ...                                           org.xtreemfs.mrc.operations.RenewCapabilityAndVoucherOperation.startRequest(RenewCapabilityAndVoucherOperation.java:87)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.execute(ProcessingStage.java:310)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.parseAndExecute(ProcessingStage.java:273)
 ...                                           org.xtreemfs.mrc.stages.ProcessingStage.processMethod(ProcessingStage.java:176)
 ...                                           org.xtreemfs.mrc.stages.MRCStage.run(MRCStage.java:129)
[ D | RPCNIOSocketServer   | ProcSt          |  43 | Feb 22 00:59:41 ] response sent

and

[ D | RPCNIOSocketServer   | PBRPCSrv@32636  |  39 | Feb 22 00:59:41 ] sent response for org.xtreemfs.foundation.pbrpc.server.RPCServerResponse: callid=1955778
[ D | RPCNIOSocketServer   | PBRPCSrv@32636  |  39 | Feb 22 00:59:41 ] java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(java.base@9-internal/Native Method)
        at sun.nio.ch.SocketDispatcher.read(java.base@9-internal/SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(java.base@9-internal/IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(java.base@9-internal/IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(java.base@9-internal/SocketChannelImpl.java:383)
        at org.xtreemfs.foundation.pbrpc.channels.SSLHandshakeOnlyChannelIO.read(SSLHandshakeOnlyChannelIO.java:155)
        at org.xtreemfs.foundation.pbrpc.server.RPCNIOSocketServer.readData(RPCNIOSocketServer.java:655)
        at org.xtreemfs.foundation.pbrpc.server.RPCNIOSocketServer.readConnection(RPCNIOSocketServer.java:387)
        at org.xtreemfs.foundation.pbrpc.server.RPCNIOSocketServer.run(RPCNIOSocketServer.java:306)

[ D | RPCNIOSocketServer   | PBRPCSrv@32636  |  39 | Feb 22 00:59:41 ] closing connection to /xxxx:xxx:xxxx:xxxx:0:0:fe7d:xxxx:56314
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] sending request for OSD list to DIR.

After these errors the mrc log contains only:

[ D | RPCNIOSocketServer   | PBRPCSrv@32636  |  39 | Feb 22 00:59:41 ] closing connection to /xxxx:xxxx:xxxx:xxxx:0:0:fe7d:xxxx:56314
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] sending request for OSD list to DIR...
[ D | RPCNIOSocketClient   | OSDStatusMan... |  41 | Feb 22 00:59:46 ] sending request org.xtreemfs.foundation.pbrpc.client.RPCClientRequest@7a0e1188 no 790498
[ D | RPCClientRequest     | MRCRequestDi... |  38 | Feb 22 00:59:46 ] sending record marker: 40/2/0
[ D | RPCClientRequest     | MRCRequestDi... |  38 | Feb 22 00:59:46 ] send buffer #1: ReusableBuffer( capacity=8192 limit=42 position=0)
[ D | RPCNIOSocketClient   | MRCRequestDi... |  38 | Feb 22 00:59:46 ] sent request 790498 to xxxxx/xxxx:xxxx:xxxx:xxxx:0:0:0:xxxx:32638
[ D | RPCResponse          | MRCRequestDi... |  38 | Feb 22 00:59:46 ] response received
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] ... received OSD list from DIR
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] response...
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] registered OSDs
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] a88d1832-484e-48f2-81c6-430fc71401d4
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] dbc52792-e3d7-416f-a21b-1740f44a6e9c
[ D | OSDStatusManager     | OSDStatusMan... |  41 | Feb 22 00:59:46 ] e114ae07-8249-47d9-aec7-745e302e08ac
[ D | RPCNIOSocketClient   | MRC Heartbea... |  48 | Feb 22 00:59:52 ] sending request org.xtreemfs.foundation.pbrpc.client.RPCClientRequest@7a87e34d no 790499
[ D | RPCClientRequest     | MRCRequestDi... |  38 | Feb 22 00:59:52 ] sending record marker: 55/38/0
[ D | RPCClientRequest     | MRCRequestDi... |  38 | Feb 22 00:59:52 ] send buffer #1: ReusableBuffer( capacity=8192 limit=93 position=0)
[ D | RPCNIOSocketClient   | MRCRequestDi... |  38 | Feb 22 00:59:52 ] sent request 790499 to xxxxxxx/xxxx:xxxx:xxxx:5a57:0:0:xxxx:xxxx:32638
[ D | RPCResponse          | MRCRequestDi... |  38 | Feb 22 00:59:52 ] response received

I need to cleanup the logs from sensitive data and enable debug on the client.

The data copy stopped at 00:59:56

2018/02/22 00:59:56 c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/14/1515934930842_1515934932842_1515489024983_445935180.mp4 1453020
c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/14/1515934932797_1515934934797_1515489024983_445937180.mp4
robert-schmidtke commented 6 years ago

Hi,

thanks for the extra information. It seems the problem is indeed in the client, judging from the logs you posted. Which SSL versions are you using? Just as a test, have you tried using the full SSL mode? I'm hoping to get some more insight from the client debug logs. Thanks for bearing with us and helping isolate the issue!

Cheers Robert

sysrqio commented 6 years ago

Hi Robert,

thank you for your patience. It takes some time to reproduce this issue. Attached you find the log of the client.

[ D |  2/23 02:18:10.661 | 7f73b17fa700   ] xtreemfs_fuse_write /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 s: 129408 o:1310720
[ D |  2/23 02:18:10.665 | 7f73b17fa700   ] xtreemfs_fuse_write finished /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 s:129408 o:1310720 w:129408
[ D |  2/23 02:18:10.666 | 7f73b1ffb700   ] xtreemfs_fuse_write /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 s: 46917 o:1440128
[ D |  2/23 02:18:10.671 | 7f73b1ffb700   ] xtreemfs_fuse_write finished /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 s:46917 o:1440128 w:46917
[ D |  2/23 02:18:10.671 | 7f73b27fc700   ] xtreemfs_fuse_flush /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.671 | 7f73b27fc700   ] WriteBackFileSize: file_id: 606791 # bytes: 1487045 close file? 0
[ D |  2/23 02:18:10.673 | 7f73b27fc700   ] xtreemfs_fuse_lock on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 command: set lock type: unlock start: 0 length: 0 pid: 0
[ D |  2/23 02:18:10.673 | 7f73b27fc700   ] FileHandleImplementation::ReleaseLock: Skipping unlock request as there is no lock known for the PID: 0 (Lock description: 0, 0, 0)
[ D |  2/23 02:18:10.673 | 7f73b3fff700   ] xtreemfs_fuse_release /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.673 | 7f73b3fff700   ] VoucherManager will finalize and clear all voucher information for file: 606791
[ D |  2/23 02:18:10.673 | 7f73b3fff700   ] Sending finalizeVouchersRequest to 1 OSD(s) containing 0 + 1 XCap(s)
[ D |  2/23 02:18:10.673 | 7f73b2ffd700   ] getattr on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.673 | 7f73b2ffd700   ] getattr: serving from stat-cache /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 0
[ D |  2/23 02:18:10.674 | 7f73b2ffd700   ] getattr: merged infos from osd_write_response, size: 1487045
[ D |  2/23 02:18:10.674 | 7f738ffff700   ] xtreemfs_fuse_getxattr  /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 security.capability 0
[ D |  2/23 02:18:10.674 | 7f73b37fe700   ] getattr on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.674 | 7f73b37fe700   ] getattr: serving from stat-cache /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 0
[ D |  2/23 02:18:10.674 | 7f73b37fe700   ] getattr: merged infos from osd_write_response, size: 1487045
[ D |  2/23 02:18:10.674 | 7f73c251f700   ] Got all expected responses!
[ D |  2/23 02:18:10.674 | 7f73b3fff700   ] Skipped checkResponseConsistency: only one OSD involved. 
[ D |  2/23 02:18:10.674 | 7f73b3fff700   ] Got consistent responses on try 1.
[ D |  2/23 02:18:10.675 | 7f73b3fff700   ] Sending clearVoucherRequest to the MRC.
[ D |  2/23 02:18:10.675 | 7f73b17fa700   ] xtreemfs_fuse_chown on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.677 | 7f73c151d700   ] START open_file_table: Periodic filesize update for 1 open files.
[ D |  2/23 02:18:10.677 | 7f73c151d700   ] END open_file_table: Periodic filesize update for 1 open files.
[ D |  2/23 02:18:10.677 | 7f73c1d1e700   ] START open_file_table: Periodic XCap renewal for 1 open files.
[ D |  2/23 02:18:10.677 | 7f73c1d1e700   ] Renew XCap for file_id: 606791 Expiration in: 600
[ D |  2/23 02:18:10.677 | 7f73c1d1e700   ] END open_file_table: Periodic XCap renewal for 1 open files.
[ D |  2/23 02:18:10.679 | 7f73b17fa700   ] MetadataCache UpdateStatAttributes: /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 to_set: 70
[ D |  2/23 02:18:10.679 | 7f73b17fa700   ] getattr on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.679 | 7f73b17fa700   ] getattr: serving from stat-cache /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 0
[ D |  2/23 02:18:10.679 | 7f73b17fa700   ] getattr: merged infos from osd_write_response, size: 1487045
[ D |  2/23 02:18:10.680 | 7f73b1ffb700   ] xtreemfs_fuse_utimens on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ E |  2/23 02:18:10.682 | 7f73c251f700   ] Renewing XCap of file: 606791 failed. Error: error_type: ERRNO
posix_errno: POSIX_ERROR_EINVAL
error_message: "No open voucher for global fileId c7ee8a65-ae8c-424a-bbcf-0512cb513e18:606791 (errno=POSIX_ERROR_EINVAL)"
debug_info: ""

[ D |  2/23 02:18:10.682 | 7f73b1ffb700   ] MetadataCache UpdateStatAttributes: /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 to_set: 48
[ D |  2/23 02:18:10.682 | 7f73b1ffb700   ] getattr on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3
[ D |  2/23 02:18:10.683 | 7f73b1ffb700   ] getattr: serving from stat-cache /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/.1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 1487045
[ D |  2/23 02:18:10.683 | 7f73b27fc700   ] getattr on path /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/1516326588832_1516326590832_1515489024983_837607680.mp4
[ D |  2/23 02:18:10.683 | 7f73b27fc700   ] MetadataCache GetStat miss: /video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/1516326588832_1516326590832_1515489024983_837607680.mp4 [100000]
[ D |  2/23 02:18:10.685 | 7f73b27fc700   ] The server xt.sysrq.io:32636 (a756c159-ee01-4543-887f-bddd803ce538) denied the requested operation. Error Value: POSIX_ERROR_ENOENT Error message: file or directory 'unifi/video/c9d1f803-8bcd-3075-8e27-0a516ab6cfd8/2018/01/19/1516326588832_1516326590832_1515489024983_837607680.mp4' does not exist (errno=POSIX_ERROR_ENOENT)

We have seen the voucher in the mrc log, too. I have not tried the full SSL mode, but I can enable it for a try. I am using the openssl version: 1.0.2g-1ubuntu4.10, which is the default in ubuntu 16.04.

robert-schmidtke commented 6 years ago

Hi, thanks for your help.

Let me try and get your use case straight, you're copying a video file into XtreemFS and after the copy is done you rename the file .1516326588832_1516326590832_1515489024983_837607680.mp4.grXPU3 --> 1516326588832_1516326590832_1515489024983_837607680.mp4? It appears some operation is trying to be performed on the file before it has been successfully created.

The voucher should be renewed, I don't think the error has something to do with the transport endpoint not connected error, or are you observing the error each time the client crashes? Are you using the quota feature of XtreemFS?

Cheers Robert

sysrqio commented 6 years ago

Yes the files are copied onto the share using the default behaviour of the tool rsync (e.g. not using --inplace). Rsync does a stat on the file, to compare timestamps and filesizes and copies the content into a temporary file. Afterwards it does a rename on the file. From the rsync manpage:

-c, --checksum This changes the way rsync checks if the files have been changed and are in need of a transfer. Without this option, rsync uses a lqquick checkrq that (by default) checks if each file's size and time of last modification match between the sender and receiver. [..]

I am using the default rsync behaiour.

I need to verify that the voucher error I posted before, from the mrc log is visible, too when vivaldi is not used. I have seen the error transport endpoint not connected every time after sending SIGABRT to rsync. The diference between vivaldi and default was that the mount.xtreemfs core dumps additionally to the transport endpoint not connected Errors.

I am not sure if this share has currently a quota set, or if I removed the quota setting for the tests. I am going to verify this.

Lukas

sysrqio commented 6 years ago

After a few days of testing and investigation, after I switched to full SSL. I was in my opinion not able to reproduce this issue as it was initially described here. There are a few issues while moving the way down to unencrypted connections, which may be related to the initial topic in this Issue.

Since the setup moved so far away, I am not sure if the issues match in here, but I give it a try.

sysrqio commented 6 years ago

OSD getting unresponsive: With a unencrypted connection,with vivaldi it is visible that the OSDs get sometimes unresponsive. The connection of the OSDs is in a datacenter interconnected by Ethernet. When this occours no new OSD is selected and the Transport is not connected anymore.

[ D |  2/24 21:41:00.142 | 7f6321ffb700   ] xtreemfs_fuse_read finished /images/control.qcow2 s:131072 o:2142370240 r:131072
[ D |  2/24 21:41:00.143 | 7f6320ff9700   ] xtreemfs_fuse_read /images/control.qcow2 s:1600 o:2142501312
[ D |  2/24 21:41:00.144 | 7f6320ff9700   ] xtreemfs_fuse_read finished /images/control.qcow2 s:1600 o:2142501312 r:1600
[ D |  2/24 21:41:00.145 | 7f63227fc700   ] xtreemfs_fuse_read /images/control.qcow2 s:129472 o:2142502912
[ D |  2/24 21:41:00.149 | 7f63227fc700   ] xtreemfs_fuse_read finished /images/control.qcow2 s:129472 o:2142502912 r:129472
[ D |  2/24 21:41:00.149 | 7f6321ffb700   ] xtreemfs_fuse_read /images/control.qcow2 s:131072 o:2142632384
[ D |  2/24 21:41:00.154 | 7f6321ffb700   ] xtreemfs_fuse_read finished /images/control.qcow2 s:131072 o:2142632384 r:131072
[ D |  2/24 21:41:00.155 | 7f6320ff9700   ] xtreemfs_fuse_read /images/control.qcow2 s:1600 o:2142763456
[ D |  2/24 21:41:00.156 | 7f6320ff9700   ] xtreemfs_fuse_read finished /images/control.qcow2 s:1600 o:2142763456 r:1600
[ D |  2/24 21:41:06.953 | 7f63237fe700   ] START open_file_table: Periodic filesize update for 2 open files.
[ D |  2/24 21:41:06.953 | 7f63237fe700   ] END open_file_table: Periodic filesize update for 2 open files.
[ D |  2/24 21:41:06.960 | 7f6323fff700   ] START open_file_table: Periodic XCap renewal for 2 open files.
[ D |  2/24 21:41:06.960 | 7f6323fff700   ] Renew XCap for file_id: 39 Expiration in: 540
[ D |  2/24 21:41:06.960 | 7f6323fff700   ] Renew XCap for file_id: 39 Expiration in: 540
[ D |  2/24 21:41:06.960 | 7f6323fff700   ] Renew XCap for file_id: 40 Expiration in: 540
[ D |  2/24 21:41:06.960 | 7f6323fff700   ] END open_file_table: Periodic XCap renewal for 2 open files.
[ D |  2/24 21:41:06.962 | 7f633080b700   ] XCap renewed for file_id: 39
[ D |  2/24 21:41:06.963 | 7f633080b700   ] XCap renewed for file_id: 39
[ D |  2/24 21:41:06.964 | 7f633080b700   ] XCap renewed for file_id: 40
[ D |  2/24 21:41:09.294 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 597
[ D |  2/24 21:41:09.294 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 597
[ E |  2/24 21:41:09.320 | 7f6331c2f700   ] Got no response from server osd5.example.com:32641 (e114ae07-8249-47d9-aec7-745e302e08ac), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.)
[ I |  2/24 21:41:09.321 | 7f6331c2f700   ] Retrying. Waiting 14 more seconds till next retry.
[ D |  2/24 21:41:24.294 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 582
[ D |  2/24 21:41:24.294 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 582
[ I |  2/24 21:41:24.298 | 7f6331c2f700   ] Retrying. Waiting 14 more seconds till next retry.
[ D |  2/24 21:41:39.296 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 567
[ D |  2/24 21:41:39.296 | 7f6331c2f700   ] AsyncWriteHandler::(Re)Write for file_id: f5d971a3-f327-42bc-95cf-f1a4da623cfd:39, XCap Expiration in: 567
[ E |  2/24 21:41:39.323 | 7f6331c2f700   ] Got no response from server osd5.example.com:32641 (e114ae07-8249-47d9-aec7-745e302e08ac), retrying (infinite attempts left) (Possible reason: The server is using SSL, and the client is not.)
[ I |  2/24 21:41:39.323 | 7f6331c2f700   ] Retrying. Waiting 14 more seconds till next retry.
[ D |  2/24 21:41:40.950 | 7f6332430700   ] Vivaldi: recalculating against: 4315274c-7abb-4241-a337-e4fa0fd9bb43
[ D |  2/24 21:41:40.961 | 7f6332430700   ] Vivaldi: ping response received. Measured time: 10 ms
[ D |  2/24 21:41:40.961 | 7f6332430700   ] Vivaldi: storing coordinates to file: (3.52833, -4.74291)
[ D |  2/24 21:41:40.988 | 7f6332430700   ] Vivaldi: Sending coordinates to DIR.
[ D |  2/24 21:41:40.989 | 7f6332430700   ] Vivaldi: sleeping during 280 s.

The latest setup contain a few settings for the OSDs mentioned in a mailinglist. This seem to improve it a bit.

max_client_queue = 1500
flease.retries = 5
flease.message_to_ms = 10000
flease.lease_timeout_ms = 300000
capability_timeout = 5000

It seems to happen regularily that the OSD logs a overloaded stage. [ W | StorageThread | OSD PreProcSt | 42 | Feb 18 23:28:37 ] stage is overloaded, request 2395121 for c7ee8a65-ae8c-424a-bbcf-0512cb513e18:140576 dropped Even when the node does not look like it has utilized network, diskio, ram or cpu.

It seems to be a general problem in a few places where a new OSD is not selected. I am wondering why there are "connection" problems between the OSDs/MRC and clients. There are no visible problems in the network connectivity. The interconnect over WAN is minimum 250Mbit (sync) with a low latency 5-10ms.

Some Examples for this:

[ W | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 14:15:36 ] client sent too many requests... not accepting new requests from /xxxx:xxxx:1efd:1701:225:xxxx:xxxx:xxxx:58250, q=101
[ W | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 14:15:37 ] client allowed to send data again: /xxxx:xxxx:1efd:1701:225:xxxx:xxxx:xxxx:58250, q=49
[ W | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 14:16:10 ] client sent too many requests... not accepting new requests from /xxxx:xxxx:1efd:1701:225:xxxx:xxxx:xxxx:58250, q=101
[ W | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 14:16:10 ] client allowed to send data again: /xxxx:xxxx:1efd:1701:225:xxxx:xxxx:xxxx:58250, q=49
[ W | RWReplicationStage   | RWReplSt        |  26 | Feb 25 03:24:21 ] (R:a88d1832-484e-48f2-81c6-430fc71401d4) replica for file f5d971a3-f327-42bc-95cf-f1a4da623cfd:39 failed (in method: processSetAuthoritativeState): INTERNAL_SERVER_ERROR/POSIX_ERROR_EIO: (R:a88d1832-484e-48f2-81c6-430fc71401d4) read status FAILED for f5d971a3-f327-42bc-95cf-f1a4da623cfd:39 on dbc52792-e3d7-416f-a21b-1740f44a6e9c (this is request #2 out of 2 which failed); java.io.IOException: sending RPC failed: server closed connection (java.io.IOException: Connection reset by peer);
java.io.IOException: sending RPC failed: server closed connection (java.io.IOException: Connection reset by peer)
        at org.xtreemfs.foundation.pbrpc.client.RPCResponse.get(RPCResponse.java:69)
        at org.xtreemfs.osd.rwre.CoordinatedReplicaUpdatePolicy$1.responseAvailable(CoordinatedReplicaUpdatePolicy.java:123)
        at org.xtreemfs.foundation.pbrpc.client.RPCResponse.requestFailed(RPCResponse.java:147)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.closeConnection(RPCNIOSocketClient.java:716)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.readConnection(RPCNIOSocketClient.java:532)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.run(RPCNIOSocketClient.java:271)
[ W | RWReplicationStage   | RWReplSt        |  22 | Feb. 26 00:36:54 ] (R:dbc52792-e3d7-416f-a21b-1740f44a6e9c) replica for file f5d971a3-f327-42bc-95cf-f1a4da623cfd:38 failed (in method: processObjectFetched): ERRNO/POSIX_ERROR_EACCES: org.xtreemfs.foundation.pbrpc.client.PBRPCException: ERRNO/POSIX_ERROR_EACCES: capability is not valid (timed out) / ;
org.xtreemfs.foundation.pbrpc.client.PBRPCException: ERRNO/POSIX_ERROR_EACCES: capability is not valid (timed out) / 
        at org.xtreemfs.foundation.pbrpc.client.RPCResponse.get(RPCResponse.java:90)
        at org.xtreemfs.osd.rwre.RWReplicationStage$4.responseAvailable(RWReplicationStage.java:564)
        at org.xtreemfs.foundation.pbrpc.client.RPCResponse.responseAvailable(RPCResponse.java:136)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.assembleResponse(RPCNIOSocketClient.java:573)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.readConnection(RPCNIOSocketClient.java:522)
        at org.xtreemfs.foundation.pbrpc.client.RPCNIOSocketClient.run(RPCNIOSocketClient.java:271)

It looks like the OSDs are quite busy on interuptions during these operations:

[ W | ReusableBuffer       | Finalizer       |   3 | Feb 25 23:46:27 ] buffer was finalized but not freed before! buffer = ReusableBuffer( capacity=8192 limit=8192 position=8192), refCount=1
[ W | ReusableBuffer       | Finalizer       |   3 | Feb 25 23:46:27 ] buffer was finalized but not freed before! buffer = ReusableBuffer( capacity=8192 limit=8192 position=8192), refCount=1
[ W | ReusableBuffer       | Finalizer       |   3 | Feb 25 23:46:27 ] buffer was finalized but not freed before! buffer = ReusableBuffer( capacity=8192 limit=8192 position=8192), refCount=1
[ W | ReusableBuffer       | Finalizer       |   3 | Feb 25 23:46:27 ] buffer was finalized but not freed before! buffer = ReusableBuffer( capacity=8192 limit=8192 position=8192), refCount=1
[ W | ReusableBuffer       | Finalizer       |   3 | Feb 25 23:46:27 ] buffer was finalized but not freed before! buffer = ReusableBuffer( capacity=8192 limit=4136 position=4136), refCount=1
[ E | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 23:55:43 ] invalid record marker size (40/39321498/0) received, closing connection to client /xxxx:xxxx:xxxx:5a57:0:0:0:xxxx:47850
[ E | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 23:56:10 ] invalid record marker size (828517681/926167142/875847990) received, closing connection to client /xxxx:xxxx:xxxx:5a57:0:0:0:xxxx:47866
[ E | RWReplicationStage   | RWReplSt        |  26 | Feb 25 23:56:35 ] received maxObjAvail event for unknow file: f5d971a3-f327-42bc-95cf-f1a4da623cfd:38
[ E | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 23:56:35 ] invalid record marker size (842479922/761607012/925709361) received, closing connection to client /xxxx:xxxx:xxxx:5a57:0:0:0:xxxx:47872
[ E | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 23:56:50 ] invalid record marker size (1577686586/0/4398) received, closing connection to client /xxxx:xxxx:xxxx:5a57:0:0:0:xxxx:47876
[ E | RPCNIOSocketServer   | PBRPCSrv@32640  |  15 | Feb 25 23:57:05 ] invalid record marker size (962796068/1697722676/1634021431) received, closing connection to client /xxxx:xxxx:xxxx:5a57:0:0:0:xxxx:47878

I noticed that the whole xtreemfs gets unstable and very slow, when one of the OSDs crashes, until it comes back. Based on the network traffic it did not start any re-replication.

I am going to add a few more issues for things I found during the investigation.

robert-schmidtke commented 6 years ago

Hi, could you share an anonymized version of your DIR, MRC and OSD configurations? What is the block size you're using, and the default replication factor? Are you primarily working with (few) large files and/or (many) small files? Are your clients running in parallel and concurrently generate the workload? How many OSDs do you have in your setup? Are they all in the same DC? I'm trying to figure out how the OSD manages to get overloaded (in particular the Preprocessing Stage). You can increase the queue capacity using the max_requests_queue_length parameter (defaults to 1000).

sysrqio commented 6 years ago

Attached you find the configuration files of the dir, mrc and one osd. The difference in the OSDs may be the listening port and listen ip and the data path.

As the stripe size I am using the default value of 128kB on a xfs filesystem and the default block size.

~# xtfsutil /mnt/xtreemfs/libvirt
Path (on volume)     /
XtreemFS file Id     f5d971a3-f327-42bc-95cf-f1a4da623cfd:1
XtreemFS URL         pbrpc://xtfs_head.example.com:32638/libvirt
Owner                root
Group                root
Type                 volume
Available Space      6.64651918 TB
Quota / Used Space   unlimited / 280 GB
Voucher Size         10 MB
Default User Quota   unlimited
Default GroupQuota   unlimited
Num. Files/Dirs      19 / 28
Access Control p.    POSIX (permissions & ACLs)
OSD Selection p.     1000,3003
Replica Selection p. 3003
Default Striping p.  STRIPING_POLICY_RAID0 / 1 / 128kB
Default Repl. p.     WaR1 with 3 replicas
Snapshots enabled    no
Tracing enabled      no
Selectable OSDs      12580201-8174-42c0-a3ef-e3f2b58eaaa0 (o3.rbx.xxxx:32640)
                     4315274c-7abb-4241-a337-e4fa0fd9bb43 (o3.ne.xxxx:32640)
                     8b1328af-85df-4d30-81e5-81f7d71396ce (o1.ne.xxxx:32640)
                     a88d1832-484e-48f2-81c6-430fc71401d4 (o1.rbx.xxxx:32640)
                     e114ae07-8249-47d9-aec7-745e302e08ac (o2.rbx.xxxx:32640)
# xfs_info /data/osd1
meta-data=/dev/mapper/data1-xtreemfs isize=512    agcount=4, agsize=118829056 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0
data     =                       bsize=4096   blocks=475316224, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=232088, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I tried multiple combinations of WqRq and WaR1 with a replication of 2 and 3, which seems to make no real difference. During my tests I tried to rsync different types of data onto the share. I tried 500GB of smaller files mostly around 1-3Mb each and a few bigger files onto a seperate share, each of them between 50Gb and 500Gb per file. All of them showed issues. I started to rsync these files in parallel to one mrc (2 parallel rsync jobs) and reduced it during the tests to one parallel rsync job at ones. Both approaches showed these issues, but if I remember correctly the parallel approach showed the issues faster. I started the tests with with 4 OSDs on one node at home and two additional OSDs in the remote DC on one node. Now I have 5 OSDs at home on 2 nodes and additional 3 OSDs on the remote DC. During the copy of the files the most traffic was in the DC only. My first tests included a massive usage if the WAN connection, which was reduced massively.

image

I am going to tune the max_requests_queue_length parameter to test this.

config.tar.gz

robert-schmidtke commented 6 years ago

Thanks for the very detailed description.

We've had some issues with rsync in the past if I remember correctly, and it might also relate to #340. My hunch is that it might have something to do with the many metadata operations performed. I see you're using no_atime=true at the MRC, which is the recommended setting. However maybe you want to specify --fuse-option noatime explicitly. You could check the output of mount to verify XtreemFS is mounted with noatime, as well as your XFS partitions. This might greatly reduce the load incurred by rsync. Seeing that parallel rsync jobs produce the issue faster hints also at this.

Where is your MRC and your DIR located in the above picture? I assume you only have one DIR and one MRC, and you're not using the replication feature of these two services?

You could also try and use bigger block sizes as well, e.g. 1 MiB for the smaller files, and 256 MiB for the larger files.

I'm starting to think that this could be an issue that will be difficult to diagnose for us, as well as fix it. This might have implications on many parts of the system, which I'm not sure we'll have the resources of addressing. Also FUSE itself has some quirks that we cannot fully work around.

Reading the manpage of rsync shows the following option: --bwlimit=RATE limit socket I/O bandwidth Maybe for an artificially limited rate, the XtreemFS services will be able to keep up with the number of request, and you could try and increase the --bwlimit until the issue occurs again. This way, we might be able to control the occurence of the issue a little better. This however is only an idea, and I'm not exactly sure this will help isolate the cause of the issue.

Thanks again for being so willing to help us out here!