xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

shouldVisit list of domain to crawl #94

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

Hi, in shouldVisit I need to know the list of domains I want to crawl. I don't 
want to hardcode it.I would like to write code like that:
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
   return false;
}
for(String domain : this.getDomainsToCrawl()){
   if (href.startsWith(domain)) {
       return true;
   }
}
return false;
}

I decide dynamically which domain I would like to crawl. I really need this 
feature.  

What is the expected output? What do you see instead?

What version of the product are you using? On what operating system?
crawler4j-2.6.1.jar

Please provide any additional information below.

Original issue reported on code.google.com by marek.hu...@gmail.com on 17 Nov 2011 at 8:00

GoogleCodeExporter commented 9 years ago
I am using the code from trunk and you can do something like this.

public boolean shouldVisit(WebURL url) {
    List<String> domainsToCrawl = (List<String>) this.getMyController().getCustomData();

   String href = url.getURL().toLowerCase();
   if (filters.matcher(href).matches()) {
      return false;
   }

   for(String domain : this.getDomainsToCrawl()){
      if (href.startsWith(domain)) {
         return true;
      }
   }
   return false;
}

And setup the controller something like this:

List<String> domainsToCrawl = new ArrayList<String>();
domainsToCrawl.add("http://www.example.com");
domainsToCrawl.add("http://www.example.net");
domainsToCrawl.add("http://www.example.org");

controller.setCustomData(domainsToCrawl);
controller.start(IndexCrawler.class, 1);

Original comment by tahs...@trademango.com on 10 Jan 2012 at 7:36

GoogleCodeExporter commented 9 years ago

Original comment by ganjisaffar@gmail.com on 23 Jan 2012 at 12:15

GoogleCodeExporter commented 9 years ago
Hi i did it like that:
 String[] urls={"http://www.url1/", "http://www.url.co.uk/home/index.html", "http://www.url3.com/index.htm", "http://www.url4.com/"}
  for(String url : urls){
    CrawlConfig config = new CrawlConfig();

        /** here is ur configurations of config **/

        controller.setCustomData(url);

        controller.addSeed(url);
        controller.start(MyCrawler.class, 1);
  }

And in the ShoulVisit Method:
public boolean shouldVisit(WebURL url) {

        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches() && (href.startsWith(this.getMyController().getCustomData().toString())) ;

                                    }

I hope that helps

Original comment by ju...@gmx.net on 13 Jul 2014 at 1:50

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 11 Aug 2014 at 12:59