xrootd / xrootd

The XRootD central repository https://my.cdash.org/index.php?project=XRootD
http://xrootd.org
Other
163 stars 151 forks source link

XCache has high error rate under load #2360

Open bbockelm opened 4 days ago

bbockelm commented 4 days ago

Test setup:

Results:

Interestingly, the failures:

Based on the fact it seems to come from the bridge, it might be a bridge problem that is just easier to trigger when running XCache. There's no logging suggesting failures at the XrdPfc/XrdPss level.

abh3 commented 4 days ago

If at all possible, could you turn up the logging level and run a failing test to get better focus on the issue?

bbockelm commented 4 days ago

xcache-failing.tar.gz

Sure -- I attach the logs I have (~450MB unpacked; it's quite verbose). You can grep on xrootd.cache to get the cache log (but include also the origin log if you want to look for the interplay). Look for the 404 responses:

time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Sending resp: 404 header len:88" daemon=xrootd.cache

That gives you the HTTP session for a request that failed (unknown.170:225) which you can then filter further. Here's that sample session:

time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 XrootdBridge: unknown.170:225@[::172.17.0.2] login as nobody" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 unknown.170:225@[::172.17.0.2] http_Protocol:  Process. lp:0xfffe6c01e398 reqstate: 0" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 unknown.170:225@[::172.17.0.2] http_Protocol: Setting monitor info go-wrk" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 0 final_:True" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 unknown.170:225@[::172.17.0.2] http_Protocol:  Process. lp:(nil) reqstate: 0" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 unknown.170:225@[::172.17.0.2] http_Protocol: Process is exiting rc:0" daemon=xrootd.cache
time="2024-10-24T10:24:11Z" level=info msg="241024 10:24:11 44659 acc_Audit: unknown.170:225@[::172.17.0.2] grant https *@[::ffff:172.17.0.2] read /test-public/hello_world-17.txt" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 acc_Audit: unknown.170:225@[::172.17.0.2] grant https *@[::ffff:172.17.0.2] read /test-public/hello_world-17.txt" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] Pss_Open: url=pelican://u225@e1f9844cd7d9:35697//test-public/hello_world-17.txt?" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 0 final_:True" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Req: fhandle:0:0:0:0" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Req: Stat for GET /test-public/hello_world-17.txt stat=1318146937978955 4194304 48 1729765453 1729765455 1729765455 0600 xrootd xrootd" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Protocol:  Process. lp:(nil) reqstate: 0" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Req: No checksum requested; skipping to request state 2" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Req: Sending full file: 4194304" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Sending resp: 200 header len:99" daemon=xrootd.cache
time="2024-10-24T10:24:15Z" level=info msg="241024 10:24:15 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Process is exiting rc:0" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 4 final_:False" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: Got data vectors to send:1" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 4 final_:True" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: Got data vectors to send:1" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol:  Process. lp:(nil) reqstate: 4" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Sending resp: 404 header len:88" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Process is exiting rc:-1" daemon=xrootd.cache
time="2024-10-24T10:24:16Z" level=info msg="241024 10:24:16 44637 XrootdXeq: unknown.170:225@[::172.17.0.2] disc 0:00:05" daemon=xrootd.cache

The important ones here are:

241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 4 final_:False
241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: Got data vectors to send:1
241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: PostProcessHTTPReq req: 2 reqstate: 4 final_:True
241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Req: Got data vectors to send:1
241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol:  Process. lp:(nil) reqstate: 4
241024 10:24:16 44637 unknown.170:225@[::172.17.0.2] http_Protocol: Sending resp: 404 header len:88

Just noticed that final_:True is set -- that might indicate that particular error is in the close submission.

nkissebe commented 4 days ago

FWIW. Thats the same spot my tests were failing in. I tracked it down to XrdXrootdTransit.cc line 570. The runStatus is check aborts the call in XrdXrootdTransitRun(). But that is as far as I could go with my limited understanding of xrootd internals.

bool XrdXrootdTransit::Run(const char *xreqP, char *xdataP, int xdataL)
{
   int movLen, rc;

// We do not allow re-entry if we are curently processing a request.
// It will be reset, as need, when a response is effected.
//
   AtomicBeg(runMutex);
   rc = AtomicInc(runStatus);
   AtomicEnd(runMutex);
   if (rc) return false;
bbockelm commented 4 days ago

@abh3 - I noticed that the failures only occur when the file open is delayed. That is, this line appears:

241024 14:59:20 64064 unknown.154:234@e1f9844cd7d9 Xrootd_Protocol: 0000 Bridge delaying request 3 sec (File hello_world-12.txt is being staged; estimated time to completion 3 seconds)

Adding some print statements, after the delay the second open increments the runStatus variable but never decrements it. The read then increments it again and never decreases it.

Finally, things blow up on the close because of the if (rc) return false; check for re-entrant calls.