Open rgaudin opened 1 week ago
Hmm, some of the other messages are just warnings - seems like it's encountering a bunch of large files, which are not loaded in the browser (as expected), and the WARC is rolled over. That should all be ok, but the browser crash is what's causing the interrupt..
If you load that particular page in Chrome, it appears to be infinitely loading the video content due to some bug in the player (presumably it was tested more in FF then Chrome). Here's what my devtools looks like on: https://fsfe.org/news/2024/news-20240920-01.en.html:
Since this is all going through the crawler (though it's not saving these partial range requests), I'm not too surprised that it causes the browser to crash eventually... Can see if there's a way we can ignore these from even being tried, but it's definitely an issue with this site...
Indeed I get the same results on Chrome here. The player seems indeed buggy. FF doesn't work either but for different reasons: there's no autoplay there and most videos dont start when clicked.
How's the code handling this? Is this firing a direct download request for each of those attempts we see here?
I bet @benoit74 will have new use cases tomorrow and will maybe be able to share another link exhibiting the issue.
How's the code handling this? Is this firing a direct download request for each of those attempts we see here?
No, it shouldn't be, should already be ignoring these, but made some more optimizations / clean-up. Some videos were being skipped for other reasons, but possible the repeated requests could result in a browser crash (though I haven't reproed that) Try this branch: https://github.com/webrecorder/browsertrix-crawler/tree/range-load-optimizations
New occurence last week (we are not responsible for the content our users are trying to ZIM, not sure they are all very aligned with our mission, didn't checked tbh):
I will probably test #709 only once released, unless you need help to test this before merge, pretty busy with other topics atm and testing a branch is not that straightforward on my end ^^ Thank you for these enhancements anyway
Found a major issue, it appears there was a status code check and only 200 responses were being streamed, but all the videos are 206, and that was excluded from streaming 🤦 . This likely resulted in the browser crash since it tried to load the whole thing into memory 🤦 . Will be in the next fix!
This week, with Browsertrix-Crawler 1.3.3 (with warcio.js 2.3.1), I am getting several cases of the following:
Browser disconnected (crashed?), interrupting crawl
logFailed to load response body
Large payload written to WARC, but not returned to browser (would require rereading into memory)
Rollover size exceeded, creating new WAR
{"crawled":27,"total":690,"pending":0,"failed":0,"limit":{"max":0,"hit":false}
Exiting, Crawl status: interrupted
I don't know of those connect to each other but This happened on multiple different websites and it happens consistently.
You can try with
https://fsfe.org/