zenodo / zenodo-rdm

Zenodo, powered by InvenioRDM
https://zenodo.org
GNU General Public License v2.0
42 stars 25 forks source link

Uploading multiple files in parallel causes some to fail #975

Open max-moser opened 2 weeks ago

max-moser commented 2 weeks ago

Trying to upload several large files at the same time to Zenodo will usually start the upload of three files in parallel. It seems like two out of those three uploads make progress pretty slowly, and after ~60s the upload fails. Since the upload form tries to keep three uploads running in parallel, the next two files start uploading. These likely time out after little progress again (unless the first upload finished in the meanwhile, then another file will start uploading successfully).

image

image

max-moser commented 2 weeks ago

A local development setup (invenio-cli run with v12.0.5) does not seem to have this issue. I limited the network speed to "Wi-Fi" in Firefox for the duration of the file uploads. After a good while, two large files were uploaded and a third upload was going on. I took the following screenshot after manually cancelling the current file upload, after finishing my last drink:

image

max-moser commented 2 weeks ago

Using the docker compose-based setup including nginx and gunicorn (pretty much the same with uwsgi in our prod instance)

image

max-moser commented 1 week ago

Update

It looks like the nginx configuration value proxy_send_timeout (or uwsgi_send_timeout) is involved, whose default value is 60s. Increasing this value to e.g. 666 (in the location handling file uploads) causes the files to upload successfully. I assume this should be set to the maximum time that a file upload is expected to last.

The setting's description reads:

Sets a timeout for transmitting a request to the proxied server. The timeout is set only between two successive write operations, not for the transmission of the whole request. If the proxied server does not receive anything within this time, the connection is closed.

The way I read it, this variable is only relevant if there's several uploads going on at the same time. Since with my local testing I generally see 0% progress on the "non-primary" file upload, that somehow makes sense to me. :warning: However, my Zenodo tests did have ~3% progress before getting aborted – could this be another problem?

Tests

Tested on a local docker compose-based setup with nginx proxying gunicorn. Uploaded ~400MB files with "Wi-Fi" connection throttling in Firefox enabled. This caused the file uploads to run for several minutes, but complete successfully.

max-moser commented 1 week ago

Another update: The above mentioned fix helps for a deployment on my local machine, but not on our test deployment :woozy_face:

Firefox shows NS_ERROR_NET_INTERRUPT: image

Gunicorn logs a 400 response: 128.131.230.92 - - [02/Sep/2024:10:04:07 +0000] "PUT /records/vggeh-pc204/draft/files/GE-Proton9-7.tar.gz/content HTTP/1.1" 400 60 "https://test.researchdata.tuwien.ac.at/uploads/vggeh-pc204" "Mozilla/5.0 (X11; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0"

Nginx logs a 408 response: 128.131.230.92 - - [02/Sep/2024:10:01:35 +0000] "PUT /api/records/vggeh-pc204/draft/files/GE-Proton9-7.tar.gz/content HTTP/2.0" 408 0 "https://test.researchdata.tuwien.ac.at/uploads/vggeh-pc204" "Mozilla/5.0 (X11; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0" "-" c7166876ef6b1dc2fae6681ed0a9a000 1725271295.000 60.612 - -

Perhaps the different timestamps are interesting? Executing date on either container shows that, as expected, they report roughly the same current timestamp.

Edit: The nginx logs pop up as soon as the upload is aborted, but the gunicorn logs for all the files pop up at the same time (in my case, once the initial upload succeeds as that takes long).

slint commented 1 week ago

Thanks for debugging this. I remember we've done some of the nginx fine-tuning from time to time when switching between infras, but after the migration to InvenioRDM we sort of ended in a "stable-enough" setup and left it untouched.

I see that right now, we don't set the uwsgi_send_timeout configuration for file uploads at all (which confirms the 60 seconds timeout behavior). On other endpoints (non-files related), I see we do have though:

      uwsgi_read_timeout 180s;
      uwsgi_send_timeout 180s;

I think the overall issue though, is that the parallel HTTP requests are not really taking place correctly from the client-side, and configuring the server to just wait longer is just to give some extra time for one of the other uploads to finish.

We have this 3-step request flow for each file upload, but I feel that some of the HTTP requests are initiated prematurely, which ends up parallelizing on the actual sending of bytes on all the open connections instead of having some sort of queueing in place... This has to be verified on the client-side logic though.

max-moser commented 1 week ago

Another update

Good news: I made a little nginx config change, and am currently uploading in 7 tabs (each for a separate new draft) in parallel. ~None of the uploads (neither the 7 in progress, nor any of the 14 stalling ones) have failed yet, over the course of several minutes.~ Some uploads failed, but there's some improvements.

Config

The relevant part of the config is:


  ## API files
  # Another location is defined in order to allow large file uploads in the files
  # API without exposing the other parts of the application to receive huge request bodies.
  location ~ /api/records/.+/draft/files(/.+(/(content|commit))?)? {
    gzip off;
    uwsgi_pass api_files_server;
    include uwsgi_params;
    uwsgi_buffering off;
    uwsgi_request_buffering off;

    # <new stuff>
    # Allow file uploads to take 24h - this should only be available to trusted users
    # After this timeout, parallel file uploads start failing
    uwsgi_read_timeout 24h;
    uwsgi_send_timeout 24h;
    client_body_timeout 24h;

    # The docs for the `uwsgi_request_buffering` directive say that if HTTP/1.1 chunked transfer encoding
    # is enabled, the request will always be buffered - we really don't want that!
    proxy_http_version 1.0;
    chunked_transfer_encoding off;
    # </new stuff>

    # ... the rest is the headers & client_max_body_size 0 ...
}

Rationale

The timeouts helped with some with the parallel uploads on a single deposit page. But with several deposit forms (i.e. browser tabs) for different drafts uploading in parallel, the symptoms still happened for some of the latter uploads.

I looked into the documentation for the uwsgi module of nginx, and found something interesting: The uwsgi_request_buffering directive says [1]:

When HTTP/1.1 chunked transfer encoding is used to send the original request body, the request body will be buffered regardless of the directive value.

After setting the proxy_http_version to 1.0 and disabling the chunked_transfer_encoding for good measure, that seems to have improved matters significantly!

Perhaps there was buffering going on, and that caused hiccups between nginx and uwsgi?

Current status

At the time of writing, I have 7 deposit forms for our test instance open, uploading 5 files with ~400MB each in parallel, with "Wi-Fi" throttling in Firefox. Each deposit form has 3 PUT requests going on in parallel (21 ongoing PUT requests total). So far (roughly 45 mins in), each deposit form has successfully uploaded one file (7 successful uploads total), and the second upload is at ~40%. Not a single failure.

First failures after ~63 mins

For each deposit form (i.e. tab) there was one failure now (on the third PUT request, after two succeeded). image

The uWSGI logs state that some workers were "cheaped" (i.e. killed), and some had to be killed more brutally:

[...]
web-api-files-1  | 2024-09-03T19:52:47.230873679Z worker 5 killed successfully (pid: 45)
web-api-files-1  | 2024-09-03T19:52:47.230908290Z uWSGI worker 5 cheaped.
web-api-files-1  | 2024-09-03T19:52:49.246274972Z Tue Sep  3 19:52:49 2024 - worker 7 (pid: 49) is taking too much time to die...NO MERCY !!!
web-api-files-1  | 2024-09-03T19:52:50.247435017Z worker 7 killed successfully (pid: 49)
web-api-files-1  | 2024-09-03T19:52:50.247460827Z uWSGI worker 7 cheaped.
web-api-files-1  | 2024-09-03T19:52:52.260832980Z Tue Sep  3 19:52:52 2024 - worker 10 (pid: 55) is taking too much time to die...NO MERCY !!!
web-api-files-1  | 2024-09-03T19:52:53.266363799Z worker 10 killed successfully (pid: 55)
web-api-files-1  | 2024-09-03T19:52:53.266387419Z uWSGI worker 10 cheaped.
[...]

I'm not entirely sure as to why, as those workers were still busy. If you squint a little bit, the timing could be some default value for timeouts. Some people also state that this could be an issue when using threading (which we do), e.g. https://github.com/unbit/uwsgi/issues/1804#issuecomment-1567780256. The new (4th) PUT request per deposit page is going strong however.

Nginx logged the failed uploads as 400 reply: frontend-1 | 128.131.230.92 - - [03/Sep/2024:19:51:36 +0000] "PUT /api/records/g9wgc-pxe81/draft/files/GE-Proton9-10.tar.gz/content HTTP/2.0" 400 0 "https://test.researchdata.tuwien.ac.at/uploads/g9wgc-pxe81" "Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0" "-" 1b4eccdb87cc1b4d633f7a787e3162a0 1725393096.868 3784.944 - -

Some more failures ~24 mins further in

The 4th PUT request per deposit page failed now (at 75% done), about 55mins after they were started according to the "timings" tab in the browser's dev tools.

Further infos

In a previous run of the same experiment an hour ago, the (first) uploads for each deposit form got stuck at ~40%. At some point during that upload, I activated the TUW VPN connection though. Probably that's what killed the uploads. Mea culpa. ⚔️

Footnotes

[1] As a side note, this seems to contradict a statement in the nginx config from the InvenioRDM cookiecutter.

max-moser commented 1 week ago

Yet another update

After disabling threading in uwsgi (since it has been reported by others to cause issues, see last comment), all uploads worked. All 35 files were uploaded successfully, 100% success rate.

[uwsgi]

# reasonable defaults
strict = true
master = true
vacuum = true
enable-threads = false
single-interpreter = true
die-on-term = true
need-app = true
auto-procname = true

# misc
wsgi-disable-file-wrapper = true
buffer-size = 8192

# worker management
# start with 4 worker processes, increase by 6 up to 64
# check every 3s if increase is needed
# (note: this feature works better with threading disabled!)
processes = 64
threads = 1
cheaper = 6
cheaper-initial = 4
cheaper-algo = spare
cheaper-overload = 3
cheaper-step = 4

# specific for API (Files)
socket = 0.0.0.0:5000
stats = 0.0.0.0:9002
module = invenio_app.wsgi_rest:application
mount = /api=invenio_app.wsgi_rest:application
manage-script-name = true
procname-prefix = "api-files "

Edit: 10x5 files. Success. :heavy_check_mark: PUT request timings ("sending" in "timings" for the requests in the dev tools) ranged from between 40min to 150min, no timeouts.

slint commented 1 week ago

Nginx

I think the contradicting nginx setting is from this SO answer, which mentions the proxy_request_buffering setting, which indeed says that if HTTP/1.1 is enabled there will be no buffering...

uWSGI

Oof, that's bad news for uWSGI... Switching from threads to processes also means more resources (mainly memory-wise) are needed to serve the same traffic (though with the small benefit of not randomly failing requests 😅).

Initializing the app and then forking can somewhat help with the memory usage because of CoW magic, though it's also a famous cause for mysterious bugs when e.g. DB connection pools are forked. But anyways, that's a resources/scaling issue for each instance. If switching to gunicorn also doesn't seem to solve it, I guess that's unfortunately the only way to go...


For reference, here's our uWSGI setup on Zenodo production, where each node can server 5 * 8 = 40 concurrent connections theoretically:

[uwsgi]
socket = 0.0.0.0:5000
module = invenio_app.wsgi:application
stats = /tmp/stats.socket

master = true
vacuum = true
enable-threads = true
processes = 8
threads = 5
thunder-lock = true # https://marc.info/?l=uwsgi&m=140473636200986&w=2
                    # https://uwsgi-docs.readthedocs.io/en/latest/articles/SerializingAccept.html

# disable-logging = true
log-4xx = true
log-5xx = true
log-ioerror = true
log-x-forwarded-for = true

# Fork, then initialize application. This is to avoid issues with shared
# DB connections pools.
lazy = true
lazy-apps = true
single-interpreter = true
need-app = true

# Silence write errors for misbehaving clients
# https://github.com/getsentry/raven-python/issues/732
ignore-sigpipe = true
ignore-write-errors = true
disable-write-exception = true

# post-buffering = true
buffer-size = 65535
socket-timeout = 120
socket-write-timeout = 120
so-write-timeout = 120
so-send-timeout = 120
socket-send-timeout = 120

# NOTE: Disabled since we now use PgBouncer for connection pooling
# Automatically respawn processes after serving
# max-requests = 3000
# max-requests-delta = 30

# fix up signal handling
die-on-term = true
max-moser commented 1 week ago

To be fair, I haven't tried gunicorn with the above mentioned nginx changes – my experiments with gunicorn were performed before I added those. From what I can tell, the threading only causes troubles with uWSGI in combination with the cheaper subsystem – without it, we haven't seen problems yet. Though to be fair, requests other than file uploads are also pretty quick (especially since we've offloaded file downloads to nginx).