What steps will reproduce the problem?
1. Take a domain URL which contains in the home page a link that will reproduce
the Issue 50
2. Prepare a list of domain URLs to crawl
3. Insert, somewhere in the middle, of the list from step 2, the URL from step 1
4. Launch the crawler
What is the expected output? What do you see instead?
Suppose that crawler logs every visited URL,
after the crawler finishes, the log file will show that some domain URLs where
completely ignored.
What version of the product are you using? On what operating system?
crawler4j-2.6.1, Ubuntu 11.04 (Natty Narwhal)
Please provide any additional information below.
I think the problem is the following: the unhanded exception form issue 50
kills the current thread
and those 50 URLs fetched by this thread (Frontier.getNextURLs(50,
assignedURLs)) will not be crawled anymore.
If I am right than any issue that will cause an uncaught exception will kill
the thread and the URLs fetched by this thread will not be crawled.
Original issue reported on code.google.com by vundic...@gmail.com on 29 Nov 2011 at 2:27
Original issue reported on code.google.com by
vundic...@gmail.com
on 29 Nov 2011 at 2:27