xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Recrawl Not Fetched Links #107

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi,
Thanks a lot for previous fixes.

After crawling some number of links, server starts not responding to the 
requests from the same PageFetcher. The reason is that the server tries to 
treat all clients the same therefore gives place to other process requests 
(produces "ERROR [Crawler 6] Fatal transport error: Read timed out while 
fetching http://..."). Therefore, even not broken links will be logged as 
broken in my logic.

I was extending crawler4j to find broken urls in our local web pages. 
My strategy to solve the above problem is to log not fetched urls in a file and 
start new N crawlers sequentially with a new PageFetcher. 

However, the problem is WebCrawler.processPage(WebUrl webUrl) method processes 
webUrl as a parent link. Therefore, the not fetched urls are not processed 
properly (as my logic is mostly in shouldVisit() method, shouldVisit() will not 
be called if not fetched links are fetched this time).

Do you have any suggestion to solve this problems? 

Thanks in advance.

P.S. It might be useful for others as well who are using crawler4j to find 
broken links.

Original issue reported on code.google.com by mansur.u...@gmail.com on 16 Jan 2012 at 3:49

GoogleCodeExporter commented 9 years ago
Hello Mansur...
Even I am trying to implement the broken links stuff in my code.. Can you give 
me a pointer on how to log the data against a WebURL ? Or how do you do it ?

Regs

Original comment by w3engine...@gmail.com on 19 Jan 2012 at 2:38

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hello,

I just save not fetched links in a text file.
However, there are some subtle issues:
1. If a page being fetched is login protected;
2. If a page is dead and the request is redirected to error page;

Or you need my logic?

Regards

Original comment by mansur.u...@gmail.com on 19 Jan 2012 at 3:23

GoogleCodeExporter commented 9 years ago
I'm closing this issue because with the new handle status code, broken links 
can be handled properly. See example here: 
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawl
er4j/examples/statushandler/

-Yasser

Original comment by ganjisaffar@gmail.com on 23 Jan 2012 at 12:12