webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
645 stars 53 forks source link

[Bug]: Firefox Won't Open WACZ from Remote Server because the Size of the File is Not Accessible #318

Open markpbaggett opened 2 months ago

markpbaggett commented 2 months ago

ReplayWeb.page Version

v2.0.0

What did you expect to happen? What happened instead?

When trying to replay a WACZ in Firefox, I get this error message:

Sorry, this URL could not be loaded because the size of the file is not accessible.
Make sure this is a valid URL and you have access to this file.

Interestingly, this works fine in both Chrome and Safari. I'm not sure if this is related to how I captured the WACZ or something else.

In case it's helpful, I used Browsertrix-Crawler and this command:

docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://abolition-now.github.io/an --generateWACZ --text --collection abolition_now_test

This opens fine in replayweb.page locally, but won't open in Firefox remotely:

https://digital.lib.utk.edu/demo/abolition_now_test.wacz

Is the web server not returning a header or something that replayweb.page is expecting?

Step-by-step reproduction instructions

  1. Open Firefox
  2. Navigate to replayweb.page
  3. Attempt to open this externally hosted resource: https://digital.lib.utk.edu/demo/abolition_now_test.wacz

Additional details

No response

Shrinks99 commented 2 months ago

Is the web server not returning a header or something that replayweb.page is expecting?

@edsu recently wrote a good forum post on exactly this problem for another user experiencing the same thing. I can partially replicate this behavior by trying to download the file in the browser, it will not be able to display a progress bar or time remaining estimate (in either Firefox or Chrome)!

... However, using curl --head returns this, which includes Content-Length so that's curious.

➜  ~ curl --head https://digital.lib.utk.edu/demo/abolition_now_test.wacz
HTTP/1.1 200 OK
Date: Mon, 29 Apr 2024 20:13:18 GMT
Server: Apache
Last-Modified: Wed, 24 Apr 2024 13:39:28 GMT
ETag: "88e8f59-616d7ccb8d565"
Accept-Ranges: bytes
Content-Length: 143560537
Vary: Accept-Encoding
Access-Control-Allow-Origin: *

Either way, if the file works fine locally it's probably not that, most likely a server config issue?? Can replicate it working in Chrome though.

ikreymer commented 3 weeks ago

The issue is unfortunately due to a bug/incorrect handling in Firefox. Chrome and Safari correct add Accept-Encoding: identity (the default) header when a range request is sent, allowing the server to return content-length. Unfortunately, Firefox always sends Accept-Encoding: gzip, deflate, br, zstd, which causes the header to return a compressed version of the WACZ, which is not what we want.

Barring the fix in Firefox, I think the best option is to ensure your server ignores the Accept-Encoding header for WACZ files.

You can see that if you do curl --head -H "Accept-Encoding: gzip, deflate, br, zstd" https://digital.lib.utk.edu/demo/abolition_now_test.wacz, there will be no content-length. Unfortunately, it seems there's no way to prevent Firefox from sending this header. (Will open a bug in firefox).