xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Not all domain URLs are crawled #97

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Take a domain URL which contains in the home page a link that will reproduce 
the Issue 50

2. Prepare a list of domain URLs to crawl

3. Insert, somewhere in the middle, of the list from step 2, the URL from step 1

4. Launch the crawler

What is the expected output? What do you see instead?

Suppose that crawler logs every visited URL, 

after the crawler finishes, the log file will show that some domain URLs where 
completely ignored.

What version of the product are you using? On what operating system?

crawler4j-2.6.1, Ubuntu 11.04 (Natty Narwhal)

Please provide any additional information below.

I think the problem is the following: the unhanded exception form issue 50 
kills the current thread 

and those 50 URLs fetched by this thread (Frontier.getNextURLs(50, 
assignedURLs)) will not be crawled anymore.

If I am right than any issue that will cause an uncaught exception will kill 
the thread and the URLs fetched by this thread will not be crawled.

Original issue reported on code.google.com by vundic...@gmail.com on 29 Nov 2011 at 2:27

GoogleCodeExporter commented 9 years ago
This issue is resolved in version 3.0

-Yasser

Original comment by ganjisaffar@gmail.com on 2 Jan 2012 at 7:48