Closed benoit74 closed 2 weeks ago
Edit : in fact, it looks like the PDF has been downloaded in fact, at least we find it in the WARC/ZIM, so the crash happens right after the file is saved to the WARC.
I achieved to reproduce the problem locally by increasing the "normal" timeout, my previous attempt probably just timed-out after 90s so the file was not really downloaded.
Proper command to be "sure" to reproduce the bug (note the --timeout 9999
, 9999 is a bit silly but at least "sufficient"):
docker run -v $PWD/output:/output --name crawlme --rm webrecorder/browsertrix-crawler:1.3.0-beta.0 crawl --url "https://www.survivorlibrary.com/index.php/Farming_Corn" --cwd /output --depth 1 --scopeType host --timeout 9999
Thanks for the consistently reproducible command! I think the simplest thing to do is to not limit fetch() to 30 seconds, that doesn't seem right, it should probably just be the same timeout, or, can try to be more clever and consider that the timeout to first byte received, rather than completion.. We definitely don't need to be using the browser to load a 800MB+ PDF! But, should also figure out why it is crashing / disconnecting here, since its not trying to render it, but downloading it to /dev/null in fact...
Have a fix in #677
We (Kiwix) are struggling to crawl https://www.survivorlibrary.com/index.php/main-library-index/.
Problem is linked to big files.
We have already setup include rule to include only PDFs and exclude ZIP files which are known to be too big, but we now realize there is even huge PDFs.
For instance, problem occurs when crawling https://www.survivorlibrary.com/library/corn_and_corn_improvement_1955.pdf which is a ~837MB file.
Logs:
It looks like the direct fetch times-out, which is not a surprise, since its value is 30s. Crawler then tries to download it "normally", but the a puppeteer disconnection occurs after some time.
Note that we have other big files where the direct download timed-out but the "normal crawl" achieved to proceed.
I do not reproduce the problem on my machine, so it is probably linked to some memory issue or whatever linked to the machine where the bug occurs.
For the record, the command I used to try to reproduce the problem (and big PDF unfortunately downloads properly):
Even if we could say that we should run this crawl on another machine, I wonder if it wouldn't make more sense to be able to customize the
FETCH_TIMEOUT_SECS
with a CLI flag, so that the direct download does not fails just due to a timeout when we know that it has good reasons to take long to complete? Would it have any adverse side-effect (aside from risking that the crawl takes long time to detect a direct fetch which is really in timeout)?