yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.54k stars 1.93k forks source link

Can't fetch content of random pages #471

Open NingMorris opened 2 years ago

NingMorris commented 2 years ago

For some reason, every time when I ran our crawl controller, there were always some random pages which failed to be crawled. By reading the logs, it says it can`t fetch content of these pages, but if I manually open the pages in browser, everything seems fine. e.g. https://www.sloans.com/inventory/john-deere-1790-173523. Notes: I tried to set the politenessDelay with null, 3 seconds and 30 seconds. It seems the greater the delay time is, more random pages fail to be crawled. If the politenessDelay is null, there are usually only a few pages which fail to be crawled. Please let me know if it is something we can fix on our end or it is related to crawler4j, thanks!

The logs is in the following. 2022-07-25 01:20:05.802 WARN 29552 --- [Crawler 2] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.sloans.com/inventory/john-deere-1790-173523

Chaiavi commented 1 year ago

Search for: " Can't fetch content of" in the WebCrawler

Override that method and dump more information - why wasn't it been fetched ? What was the http response code ? Any other information ?

That will help you understand the problem.

Try also fetching only that page - did it succeed ?

On Mon, Jul 25, 2022 at 8:50 PM liukuan1 @.***> wrote:

您的邮件我已收到,我将及时查看!谢谢!

— Reply to this email directly, view it on GitHub https://github.com/yasserg/crawler4j/issues/471#issuecomment-1194411324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANNWW2U2YL2DTLXHG76T6LVV3HXVANCNFSM54TDQ5PQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>