xrootd / xrootd

The XRootD central repository https://my.cdash.org/index.php?project=XRootD
http://xrootd.org
Other
147 stars 149 forks source link

TPC issue via davs:// to redirector with multiple independent servers behind it #2241

Closed MarcusEbert closed 17 hours ago

MarcusEbert commented 2 months ago

Using davs:// in a TPC via gfal-copy/davix-cp results in a high failure rate when the system consist of a redirector and multiple servers each with their own local file system. Issue described below happens when the xrootd install is the destination and the TPC is done in pull mode (xrootd is the active site).

Test system to replicate the issue we saw in the production system:

command used: gfal-copy -p -f --copy-mode pull https://belle2-webdav-analysis-data.cc.kek.jp:8443/disk/belle/TMP/belle/Raw/e0012_8GBTest/physics/r04420/sub00/physics.0012.4420.8GBTest.f00000.root https://elephant36.heprc.uvic.ca:1094/TMP/belle3/some_test_file4

Error: TRANSFER ERROR: Copy failed (3rd pull). Last attempt: copy HTTP 400 : Server Error

From the redirector log: the copy request fails since on server2 it does not exist yet.

In case of having the directory not existent on any server, there would be an additional step in between to create it, but that again does not prevent the final copy redirect to go to a server which does not have the directory yet. Failure rate is ~50% with 2 equal servers.

Logs from the redirector and the servers are attached.

Since the server that gets the copy command in the end checked on the permissions of doing the copy and only fails because the full path doesn't exist yet; shouldn't the server itself just create the directory instead of failing to the client?

cmsd_redirector.log xrd_server2.log xrd_server1.log xrd_redirector.log (server1 had a bit more debug enabled than server2; let me know if you want me to redo it with different debug output)

MarcusEbert commented 1 month ago

Is there any progress on that issue? Currently, this prevents to use davs:// on a system with multiple independent (==non-shared file system) servers behind a redirector when TPC transfers via davs:// pull are needed. (We may run out of space on our single server mid-June and it would be great if we could extend the space with additional servers by then; push unfortunately doesn't work in the experiment's setup)

abh3 commented 1 month ago

We have devised a fix. The patch will appear in 5.7.0. The patch will require that you deploy the new version on all of your data servers.

MarcusEbert commented 1 month ago

Thanks for the good news! We'll try it out when 5.7.0 comes out.

amadio commented 1 week ago

Fixed by f564c61990e6beb73d3d2a3e042e4ed0471a4c2d in devel.

amadio commented 1 week ago

@MarcusEbert Could you please confirm to us if the commit above indeed fixes the issue you observe? Thank you!

MarcusEbert commented 1 week ago

@amadio I need to download devel and compile it to test?

amadio commented 1 week ago

No, I just launched a build of the RPMs for the devel branch. You can download the pre-built RPMs from here: https://github.com/xrootd/xrootd/actions/runs/9618617477 and install them with dnf. Just go into the right subdirectory and use dnf install *.rpm.

MarcusEbert commented 1 week ago

Thanks! Will do so; probably early next week.

MarcusEbert commented 17 hours ago

I could not reproduce the issue when testing with the new RPMs!

amadio commented 17 hours ago

Very good, thank you for confirming it. Closing!