xrootd / xrootd

The XRootD central repository https://my.cdash.org/index.php?project=XRootD
http://xrootd.org
Other
151 stars 150 forks source link

XrdHttp-TPC: Issue with TPC multi-stream PULL #2108

Open ccaffy opened 10 months ago

ccaffy commented 10 months ago

While implementing the packet marking for HTTP TPC transfers, I saw that there is an issue with multistream HTTP TPC PULL transfers.

The request:

export DST=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY
export SRC=http://xrootd-ccaffy-dev01.cern.ch:1096/tmp/BIGFILE_5GB

curl -v --capath /etc/grid-security/certificates -L -X COPY  -H "Source: $DST" -H "SciTag: 144" -H 'X-Number-Of-Streams: 3' "$SRC"

On the client-side:

* About to connect() to xrootd-ccaffy-dev01.cern.ch port 1096 (#0)
*   Trying 137.138.123.14...
* Connected to xrootd-ccaffy-dev01.cern.ch (137.138.123.14) port 1096 (#0)
> COPY /tmp/BIGFILE_5GB HTTP/1.1
> User-Agent: curl/7.29.0
> Host: xrootd-ccaffy-dev01.cern.ch:1096
> Accept: */*
> Source: http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY
> SciTag: 144
> X-Number-Of-Streams: 3
> 
< HTTP/1.1 201 Created
< Connection: Keep-Alive
< Server: XrootD/v5.5.5-361-g1c80707
< Content-Type: text/plain
< Transfer-Encoding: chunked
< 
Perf Marker
Timestamp: 1697721653
Stripe Index: 0
Stripe Bytes Transferred: 50331648
Total Stripe Count: 1
RemoteConnections: tcp:137.138.123.14:2001
End
* Connection #0 to host xrootd-ccaffy-dev01.cern.ch left intact
failure: Internal logic error led to early abort; current offset is 50331648 while full size is 5368709120

The logs on the pulling machine:

231019 15:00:27 31857 anon.0:34@xrootd-ccaffy-dev01 http_Req: Appended header fields to opaque info: 'scitag.flow=144'
231019 15:00:27 31857 TPC_PullRequest: event=PULL_START, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous); Starting a pull request
[New Thread 0x7fffdffff700 (LWP 31914)]
[Thread 0x7fffdffff700 (LWP 31914) exited]
231019 15:00:27 31857 TPC_PullRequest: event=SIZE_SUCCESS, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3; Successfully determined remote size for pull request: 5368709120
231019 15:00:28 31857 TPC_PullRequest: event=SIZE_SUCCESS, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3; Successfully determined remote size for pull request: 5368709120
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Starting chunked response
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Sending resp: 201 header len:138
231019 15:00:28 31857 http_Protocol: Sending 138 bytes
231019 15:00:28 31857 TPC_PullRequest: event=RESPONSE_START, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3; Initial transfer response sent to the TPC client
231019 15:00:28 31857 TPC_PullRequest: event=PERF_MARKER, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3, bytes_transferred=50331648
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Sending encoded chunk of size 154
231019 15:00:28 31857 http_Protocol: Sending 4 bytes
231019 15:00:28 31857 http_Protocol: Sending 154 bytes
231019 15:00:28 31857 http_Protocol: Sending 2 bytes
[New Thread 0x7fffdffff700 (LWP 31915)]
[Thread 0x7fffdffff700 (LWP 31915) exited]
231019 15:00:28 31857 TPC_CanStartTransfer: Unable to start transfers as no buffers are available.  Available buffers: 0, Active curl handles: 0, Transfers in progress: 0
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Beginning dump of stream buffers.
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Stream offset: 0
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Buffer 0: Offset=16777216, Size=16777216, Capacity=16777216
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Buffer 1: Offset=33554432, Size=16777216, Capacity=16777216
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Buffer 2: Offset=0, Size=16777216, Capacity=16777216
231019 15:00:28 31857 TPC_Stream::DumpBuffers: Finish dump of stream buffers.
231019 15:00:28 31857 TPC_StartTransfers: Unable to start transfers.
231019 15:00:28 31857 TPC_PullRequest: event=MULTISTREAM_IDLE, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3, bytes_transferred=50331648; No handles are able to run.  Streams=3, concurrency=3
231019 15:00:28 31857 TPC_PullRequest: event=MULTISTREAM_FAIL, local=/tmp/BIGFILE_5GB, remote=http://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=3, bytes_transferred=50331648, tpc_status=0; failure: Internal logic error led to early abort; current offset is 50331648 while full size is 5368709120
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Sending encoded chunk of size 106
231019 15:00:28 31857 http_Protocol: Sending 4 bytes
231019 15:00:28 31857 http_Protocol: Sending 106 bytes
231019 15:00:28 31857 http_Protocol: Sending 2 bytes
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Sending encoded chunk of size 0
231019 15:00:28 31857 http_Protocol: Sending 3 bytes
231019 15:00:28 31857 http_Protocol: Sending 2 bytes
231019 15:00:28 31857 http_Req:  XrdHttpReq request ended.
231019 15:00:28 31857 anon.0:34@xrootd-ccaffy-dev01 http_Protocol: Process is exiting rc:1
231019 15:00:28 31857 Xrd_Poll: Poller 0 enabled anon.0:34@xrootd-ccaffy-dev01
231019 15:00:28 31859 anon.0:34@xrootd-ccaffy-dev01 Xrd_Poll: Link 34 terminating: hangup
231019 15:00:28 31877 Xrd_Sched: running anon.0:34@xrootd-ccaffy-dev01 inq=0
231019 15:00:28 31877 http_Protocol:  Cleanup
231019 15:00:28 31877 http_Protocol:  Reset
231019 15:00:28 31877 http_Req:  XrdHttpReq request ended.
231019 15:00:28 31877 anon.0:34@xrootd-ccaffy-dev01 Xrd_Poll: Poller 0 removing FD 34
231019 15:00:28 31877 anon.0:34@xrootd-ccaffy-dev01 Xrd_Poll: FD 34 detached from poller 0; num=0
231019 15:00:59 31862 TLS_Refresh: CRL refresh will happen in 60 seconds.
231019 15:00:59 31869 TLS_Refresh: CRL refresh created new x509 store.
231019 15:00:59 31869 TLS_Refresh: CRL refresh will happen in 60 seconds.
amadio commented 9 months ago

This may be another case in which libcurl is to blame. The same command works fine on my machine, where I have two servers on different ports setup for TPC transfers:

$ curl -v --capath /etc/grid-security/certificates -L -X COPY  -H "Source: $DST" -H "SciTag: 144" -H 'X-Number-Of-Streams: 3' "$SRC"
*   Trying [2001:1458:202:228::100:82]:8081...
* Connected to gentoo.cern.ch (2001:1458:202:228::100:82) port 8081
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/grid-security/certificates
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate:
*  subject: DC=ch; DC=cern; OU=computers; CN=gentoo.cern.ch
*  start date: Jun  6 14:47:51 2023 GMT
*  expire date: Jul 10 14:47:51 2024 GMT
*  subjectAltName: host "gentoo.cern.ch" matched cert's "gentoo.cern.ch"
*  issuer: DC=ch; DC=cern; CN=CERN Grid Certification Authority
*  SSL certificate verify ok.
* using HTTP/1.x
> COPY //file.raw HTTP/1.1
> Host: gentoo.cern.ch:8081
> User-Agent: curl/8.4.0
> Accept: */*
> Source: https://gentoo.cern.ch:8082//file.raw
> SciTag: 144
> X-Number-Of-Streams: 3
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/1.1 201 Created
< Connection: Keep-Alive
< Server: XrootD/v5.6.3-88-g9d2627fda
< Content-Type: text/plain
< Transfer-Encoding: chunked
< 
Perf Marker
Timestamp: 1701859536
Stripe Index: 0
Stripe Bytes Transferred: 805306368
Total Stripe Count: 1
RemoteConnections: tcp:[2001:1458:202:228::100:82]:8082
End
* Connection #0 to host gentoo.cern.ch left intact
success: Created
ccaffy commented 3 months ago

I've activated more logs and here's what I see:

240517 17:44:01 28179 TPC_CanStartTransfer: Unable to start transfers as no buffers are available.  Available buffers:
 0, Active curl handles: 0, Transfers in progress: 0
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Beginning dump of stream buffers.
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Stream offset: 0
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 0: Offset=117440512, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 1: Offset=16777216, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 2: Offset=67108864, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 3: Offset=100663296, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 4: Offset=134217728, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 5: Offset=150994944, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 6: Offset=33554432, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 7: Offset=50331648, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 8: Offset=83886080, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Buffer 9: Offset=0, Size=16777216, Capacity=16777216
240517 17:44:01 28179 TPC_Stream::DumpBuffers: Finish dump of stream buffers.
240517 17:44:01 28179 TPC_StartTransfers: Unable to start transfers.
240517 17:44:01 28179 TPC_PullRequest: event=MULTISTREAM_IDLE, local=/tmp/BIGFILE_5GB, remote=https://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=10, bytes_transferred=167772160; No handles are able to run.  Streams=10, concurrency=10
240517 17:44:01 28179 TPC_PullRequest: event=MULTISTREAM_FAIL, local=/tmp/BIGFILE_5GB, remote=https://xrootd-ccaffy-dev01.cern.ch:2001/tmp/BIGFILE_5GB_COPY, user=(anonymous), streams=10, bytes_transferred=167772160, tpc_status=0; failure: Internal logic error led to early abort; current offset is 167772160 while full size is 467664896

We can see that all in-memory buffers are full and none of them were flushed to the disk to liberate them in order to allow new streams to come in... The reason for that is unknown to me. This error happens randomly while trying to run a TPC transfer. So there might be a bug somewhere in the way we manage the buffers internally...