yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.55k stars 1.93k forks source link

[Question] How to determine controller in visit(Page page) ? #328

Closed rockyhalo closed 6 years ago

rockyhalo commented 6 years ago

I have started 2 controller, how do i determine for which controller did i get call on visit(Page page) function.

shalipoto commented 6 years ago

Have you tried using getThread().getName() ? You can put this in a System.out.println or logger.debug and it will return a string Crawler 1 . Just place this in the visit() method.

rockyhalo commented 6 years ago

This will always give crawler 1 even if it is in response of 5 controllers. I don't think we have option to name a controller.

shalipoto commented 6 years ago

I put the getThread().getName() call in the visit() method of the BasicCrawler class and I ran it with 75 crawlers and it worked.

shalipoto commented 6 years ago

I just crawled wikipedia.org with 3 crawlers just now and I have copied the output lines with the visit(Page, page) statements:

12:41:14 INFO [Crawler 1] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/Main_Page 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Sub-domain: 'en' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Path: '/wiki/Main_Page' 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Parent page: null 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- Anchor text: null 12:41:14 DEBUG [Crawler 1] - [WebCrawler]- In the visit(Page page method of:Crawler 1

12:41:14 INFO [Crawler 2] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/David_Sugarbaker 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Sub-domain: 'en' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Path: '/wiki/David_Sugarbaker' 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Parent page: https://en.wikipedia.org/wiki/Main_Page 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- Anchor text: David Sugarbaker 12:41:14 DEBUG [Crawler 2] - [WebCrawler]- In the visit(Page page method of:Crawler 2

12:41:17 INFO [Crawler 3] - [WebCrawler]- URL: https://en.wikipedia.org/wiki/Beer_festival 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Domain: 'wikipedia.org' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Sub-domain: 'en' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Path: '/wiki/Beer_festival' 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Parent page: https://en.wikipedia.org/wiki/Main_Page 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- Anchor text: beer festivals 12:41:17 DEBUG [Crawler 3] - [WebCrawler]- In the visit(Page page method of:Crawler 3

Below are the lines of code in the visit(Page page) method with the last one I just added:

    logger.debug("Docid: {}", docid);

    logger.info("URL: {}", url);

    logger.debug("Domain: '{}'", domain);

    logger.debug("Sub-domain: '{}'", subDomain);

    logger.debug("Path: '{}'", path);

    logger.debug("Parent page: {}", parentUrl);

    logger.debug("Anchor text: {}", anchor);

    logger.debug("In the visit(Page page method of:" + getThread().getName());`
shalipoto commented 6 years ago

I remembered that if your crawl is too shallow then it can end quickly with only the first thread grabbing the first link and the other threads without any links to crawl. Try crawling a large site with a depth of 5 with three crawlers.

pgalbraith commented 6 years ago

May I ask why you are creating multiple controllers, as opposed to a single controller with multiple crawlers?

shalipoto commented 6 years ago

As most of us do, i'm using just one controller to create multiple crawlers with each crawler having its own thread.

pgalbraith commented 6 years ago

sorry, the question was for @rockyhalo

shalipoto commented 6 years ago

@rockyhalo, no problem.

rockyhalo commented 6 years ago

@pgalbraith I have a different use case. Suppose i start a crawler with 2 seed url and 2 crawlers. Now if i need to start another crawler with 3 seed url and 2 crawlers what do i do ? Either i have to use a controller or wait for existing crawling to stop.

If now i make a new controller they will have count again as crawler 1 and crawler 2.

So my main aim here is to identify first 2 seed url as separate identity and second 3 seed url as separate identity.

I may be wrong or there might be a better way but i fell this is an issue as we should know which controller response is coming in visit page.

pgalbraith commented 6 years ago

@rockyhalo It looks to me like you can start your controller asynchronously and then continue to call controller.addSeed(String pageUrl) to add new seed URLs as you discover them after the controller is already started. But that's still a single controller using a fixed number of threads.

If you really need to have separate controllers with possibly differing number of threads then there are also some simple ways that you can have your crawler identify it's "owning" controller.

You could extend CrawlController and add an identifier field. Then in your crawler class use getMyController() and cast it to to your extended class type to obtain the identifier.

You could also make use of the WebCrawlerFactory and have the factory inject a controller identifier of some sort into each WebCrawler as they are instantiated.

shalipoto commented 6 years ago

@pgalbraith, I think I know where I missed the point on the issue. Wouldn't it be also possible to do a toString(myController) within the crawler instance? The superclass Webcrawler hasprotected CrawlController myController as a dependency.

pgalbraith commented 6 years ago

@shalipoto If I understand correctly you're also suggesting extending CrawlController and adding a toString() method in the extension? This works too, and has the benefit that the crawler wouldn't need to cast the controller when getting a reference to it.

shalipoto commented 6 years ago

@pgalbraith, no need to extend the controller but just add the myController.toString() statement to the crawler file in the visit(Page page) method. I got an output something like this:

In the visit() method of the SavePageWebCrawler: Crawler 1 called by the controller: edu.uci.ics.crawler4j.crawler.CrawlController@57bc24d0

pgalbraith commented 6 years ago

Yes that works if the hash code is enough for the OP's needs.

shalipoto commented 6 years ago

@rockyhalo, what do you think?

rockyhalo commented 6 years ago

@shalipoto I tried toString of mycontroller class and looks like i can manage to do my works. Thanks @shalipoto @pgalbraith

shalipoto commented 6 years ago

@rockyhalo, if you are happy with the answers, can you close this issue please?