Closed rockyhalo closed 6 years ago
Have you tried using getThread().getName()
? You can put this in a System.out.println or logger.debug and it will return a string Crawler 1 . Just place this in the visit() method.
This will always give crawler 1 even if it is in response of 5 controllers. I don't think we have option to name a controller.
I put the getThread().getName()
call in the visit() method of the BasicCrawler class and I ran it with 75 crawlers and it worked.
I just crawled wikipedia.org with 3 crawlers just now and I have copied the output lines with the visit(Page, page) statements:
12:41:14 INFO [Crawler 1] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/Main_Page 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Sub-domain: 'en' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Path: '/wiki/Main_Page' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Parent page: null 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Anchor text: null 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- In the visit(Page page method of:Crawler 1
12:41:14 INFO [Crawler 2] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/David_Sugarbaker 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Sub-domain: 'en' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Path: '/wiki/David_Sugarbaker' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Parent page: https://en.wikipedia.org/wiki/Main_Page 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Anchor text: David Sugarbaker 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- In the visit(Page page method of:Crawler 2
12:41:17 INFO [Crawler 3] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/Beer_festival 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Sub-domain: 'en' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Path: '/wiki/Beer_festival' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Parent page: https://en.wikipedia.org/wiki/Main_Page 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Anchor text: beer festivals 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- In the visit(Page page method of:Crawler 3
Below are the lines of code in the visit(Page page) method with the last one I just added:
logger.debug("Docid: {}", docid);
logger.info("URL: {}", url);
logger.debug("Domain: '{}'", domain);
logger.debug("Sub-domain: '{}'", subDomain);
logger.debug("Path: '{}'", path);
logger.debug("Parent page: {}", parentUrl);
logger.debug("Anchor text: {}", anchor);
logger.debug("In the visit(Page page method of:" + getThread().getName());`
I remembered that if your crawl is too shallow then it can end quickly with only the first thread grabbing the first link and the other threads without any links to crawl. Try crawling a large site with a depth of 5 with three crawlers.
May I ask why you are creating multiple controllers, as opposed to a single controller with multiple crawlers?
As most of us do, i'm using just one controller to create multiple crawlers with each crawler having its own thread.
sorry, the question was for @rockyhalo
@rockyhalo, no problem.
@pgalbraith I have a different use case. Suppose i start a crawler with 2 seed url and 2 crawlers. Now if i need to start another crawler with 3 seed url and 2 crawlers what do i do ? Either i have to use a controller or wait for existing crawling to stop.
If now i make a new controller they will have count again as crawler 1 and crawler 2.
So my main aim here is to identify first 2 seed url as separate identity and second 3 seed url as separate identity.
I may be wrong or there might be a better way but i fell this is an issue as we should know which controller response is coming in visit page.
@rockyhalo It looks to me like you can start your controller asynchronously and then continue to call controller.addSeed(String pageUrl)
to add new seed URLs as you discover them after the controller is already started. But that's still a single controller using a fixed number of threads.
If you really need to have separate controllers with possibly differing number of threads then there are also some simple ways that you can have your crawler identify it's "owning" controller.
You could extend CrawlController
and add an identifier field. Then in your crawler class use getMyController()
and cast it to to your extended class type to obtain the identifier.
You could also make use of the WebCrawlerFactory and have the factory inject a controller identifier of some sort into each WebCrawler as they are instantiated.
@pgalbraith, I think I know where I missed the point on the issue. Wouldn't it be also possible to do a toString(myController)
within the crawler instance? The superclass Webcrawler hasprotected CrawlController myController
as a dependency.
@shalipoto If I understand correctly you're also suggesting extending CrawlController
and adding a toString()
method in the extension? This works too, and has the benefit that the crawler wouldn't need to cast the controller when getting a reference to it.
@pgalbraith, no need to extend the controller but just add the myController.toString()
statement to the crawler file in the visit(Page page) method. I got an output something like this:
In the visit() method of the SavePageWebCrawler: Crawler 1 called by the controller: edu.uci.ics.crawler4j.crawler.CrawlController@57bc24d0
Yes that works if the hash code is enough for the OP's needs.
@rockyhalo, what do you think?
@shalipoto I tried toString of mycontroller class and looks like i can manage to do my works. Thanks @shalipoto @pgalbraith
@rockyhalo, if you are happy with the answers, can you close this issue please?
I have started 2 controller, how do i determine for which controller did i get call on visit(Page page) function.