xrootd / xrootd

The XRootD central repository https://my.cdash.org/index.php?project=XRootD
http://xrootd.org
Other
151 stars 151 forks source link

server drops some new connections and logs "attempt to reuse active link" #1928

Closed smithdh closed 1 year ago

smithdh commented 1 year ago

Hello. It was noticed that for a time (about 24 hours) one of the EOS disk servers at cern would drop some new incoming connections and, in apparent correlation log entries like:

230217 01:12:17 281970 XrdLink: attempt to reuse active link

The situation seemed to return to normal about >24 hours after the first such message. (I've done some investigation and hope to have a potential fix. Will describe below, in the comments)

smithdh commented 1 year ago

Trying to summarise some (perhaps) relevant extra information:

Over several hours before the situation where the server was dropping connections, there were a number of log entries mentioning 'bad file descriptor', such as:

e.g. grepping for ":1387@" gave this sequence:

[...]
230216 22:36:41 time=1676583401.392132 func=_close                   level=INFO  logid=unknown unit=fst@st-096-gg500f11.cern.ch:1095 tid=00007f14d91fe700 source=XrdFstOfsFile:2251             tident=daemon.75437:3834@st-096-gg500b5c sec=      uid=1 gid=1 name=nobody geo="" msg="done close" rc=0 errc=0
230216 22:36:41 282023 XrootdXeq: daemon.75437:3834@st-096-gg500b5c disc 0:00:02
230216 22:36:48 25875 XrootdXeq: 3.22020:3834@p06636710m46495 pub IP64 login as daemon
230216 22:36:48 25875 XrootdXeq: 3.22020:3834@p06636710m46495 disc 0:00:07
230216 22:36:48 178174 XrdLink: Unable to close 3.22020:3834@p06636710m46495; bad file descriptor
230216 22:36:58 292432 XrootdXeq: daemon.189977:3834@st-096-ff809e4a pub IP64 login as daemon
[...]

The first line is EOS specific and relates to closing a file, next line the close of connection by xrootd. Then a new connection on fd 3834 is logged (22:36:48), which is then closed very shortly afterwards (also logged at 22:36:48). The connection time indicates that connection started 7 seconds before; it is possible the machine had some loading or other issue that delayed everything for a few seconds at this point (indeed this may be a contributing factor). However a short time later, still at 22:36:48 there is a close error logged from a different thread than the "disc" message. The final line is a new connection on this descriptor 10 seconds later, which is unremarkable.

About 24 hours after the issue there are examples of a few log messages such as:

230218 02:25:22 57131 XrdLinkScan: Link cmsplt01.15591:2640@b7g29p5493 is disabled and idle.

(however this message is not unique to this period).

My interpretation is that there is a the possibility of two XrdLinkXeq::Close on a link in succession. The first call of Close() calls Recycle() the Protocol (generating the 'disc' log line) and then closes the network FD (without message). A second XrdLinkXeq::Close() would then not Recycle() the Protocol (as it has been set to or the null XrdPoll_End protocol), but the close(fd) returns as error (which is logged as the "bad file descriptor" line).

Such a possibility seems to exist (and I have reproduced approximately this by adding a couple of sleep() calls), e.g. in the following situation:

(1) A connection is Accepted() in "thread 1", assigned to a Link and the Link successfully matched to a protocol in XrdProtLoad::Process, consuming 20 bytes from the connection. The link has Activate() called successfully, (i.e. the FD is added to a poller's epoll set, but with (0) for the selected events). (2) The connection is closed (e.g. client process is terminated). (3) Execution of "thread 1" continues at the bottom of XrdProtLoad::Process, where it directly calls the protocol's Process() method which returns -1 (since it read the end of stream during Recv), and execution returns to XrdLinkXeq::DoIt(). DoIt() calls XrdLinkXeq::Close(). (4) The poller's poll loop does an epoll_wait (lets say in "thread 2"). The connection FD is already in the poll set, with 0 events selected. But the epoll can still return error events EPOLLERR and EPOLLHUP. It returns one (EPOLLHUP in this case, I believe). (5) "Thread 1" continues in Close(). It calls Recycle() on Protocol; detaches the poller (removes FD from poll set), sets PollInfo.Poller =0, PollInfo.FD=-1, and the Link is Unhook() and available to be Allocated again; and closes the (fd). (6) In "Thread 2" resumes in XrdPollE::Start() and evaluates the condition if (!(pInfo->isEnabled) && pInfo->FD >= 0), which is false since pInfo->FD = -1. The false branch causes it to Schedule the Link. (7) A "thread 3" then executes XrdLinkXeq::DoIt() of the Link. Protocol is XrdPoll_End, and rc will be -1. Close() will be executed again.

The idea is that between (5)&(6) a new conneciton can come in, be assigned to FD and assuming that its Link is assigned to a different poller it can go on and could even become fully established and idle. The step (7) would then close its FD, but leaving the Link busy (i.e. LinkBat[fd] = XRDLINK_USED, leading to situation described in this issue). The Link may be cleared up by the idleScan if configured. (Indeed on the EOS node it is set with 'xrd.timeout idle 86400').

smithdh commented 1 year ago

I'll attach a possible fix now ("issue-1928-fix-v1.txt"); but I think I'll wait until later in the week before making a PR; To give some more time to check this again, maybe reverify the steps in the reproduced cases etc.

@abh3 perhaps you'd also like to think about if either the flow and/or a fix makes sense to you. (Comment here or by email, if you like, or wait for a PR to comment, if that's better). Thanks very much.

smithdh commented 1 year ago

issue-1928-fix-v1.txt

abh3 commented 1 year ago

This looks like a plausible scenario. As you peruse the code you will find various comments that allude to trying to fix these kinds of problems. The fixes are not straightforward largely because epoll can generate error events even when you don't want them (i.e., there is no way to completely suppress all epoll events). So, you can get into this kind of thread race without even knowing about it. That's why we have the fd firewall and when we get into this situation the connections in the race are dropped to avoid more serious consequences. The decision to drop a connection is based on the fact that a fd was recycled by the OS but we still think it's in use and there are live objects associated with it. That's a last ditch effort to keep the server from crashing and it works but obviously has side-effects. On the other hand, since the client will simply try to reconnect those side-effects are rarely visible other than adding latency and logging serious messages. I'll be very interested to see what you solution will be.

smithdh commented 1 year ago

Hi. Thanks for the information Andy. I realised that my proposed mechanism whereby our server stopped showing errors [the idleScan clearing the situation] doesn't quite fit with the logs: I had better understand and/or reproduce this aspect too, if there's something not understood about what has happened I think it's risky to make a solution. The server is still running, a core image would be ideal but this is part of the production service so things are delicate [i.e. not impacting anyone], of course. I'll update this issue when I can, probably not before next week though.