yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.53k stars 1.93k forks source link

Mechanism to catch already Seen URL #126

Open Vineeth-Mohan opened 8 years ago

Vineeth-Mohan commented 8 years ago

If the URL to be fetched is already seen , its discarded silently. We would like to have a mechanism where in , we should be able to get notification on this. Please provide a function that we can over ride to achieve this functionality.

Chaiavi commented 8 years ago

I think this functionality is far from the purpose of crawler4j

A part of the core functionality of any crawler - cralwer4j included is to save all of the crawled urls in a special db (frontier) in order NOT to crawl them again.

I think this functionality shouldn't be included in this project (you can fork it and create your own...)

@yasserg this issue can be closed

Vineeth-Mohan commented 8 years ago

@Chaiavi - I feel this is a very valid feature. For eg: , lets say i need to track number of inbound links to a web page. If i can get notified on discovering a duplicate page , looking at the parent link , i can tell if this inbound link is tracker or not. This way , i can maintain the number of inbound links to to a page and use it for , say implement something like google page rank algorithm.

rzo1 commented 8 years ago

@Vineeth-Mohan I worked a few months in extending crawler4j to be used over multiple physical machines and run into similar problems, e.g. you crawl the entire Web and you like to update Web-pages stored in your document repositories.

The responsible class for checking for duplicate URLs is called the DocIDServer, which is assigned to the CrawlController in its constructor. In order to get notifications, you could extend the DocIDServer and take a special look at public boolean isSeenBefore(String url). This method decides, if the URL was seen before by any crawler thread.

Vineeth-Mohan commented 8 years ago

@rzo1 - Looks good to me. I will go with it. But it feel this is a main stream feature request. @Chaiavi - Please comment

Chaiavi commented 8 years ago

@rzo1 I have seen you around here many times, from your experience with crawler4j & crawlers overall do you think this functionality should be implemented in the crawler ?

rzo1 commented 8 years ago

For certain use-cases this feature might be very interesting, e.g. let's say we would like to:

  1. Crawl the entire Web (or parts of it using focused crawling) and store the retrieved and cleaned HTML in some kind of document repository. For this purpse, assigning the unique document id is necessary to avoid duplicates.
  2. After - let's say 1 month - I would like to update my document repository: with the current implementation (assuming resumeable=true, I have to drop my gathered document id file database in order to perform an update on my document repository. However, this is not very inefficient...

Other use-case from a more scientifc view:

  1. Keeping track on the occurrance of duplicate URLs might be interesting for statistic modelling for a focused crawling-approach, since this might be a parameter to estimate the size of the Web for a certain domain.

I think, that it would be a benefit to decouple the hard binding in the CrawlController:

    env = new Environment(envHome, envConfig);
    docIdServer = new DocIDServer(env, config);
    frontier = new Frontier(env, config);

If this would be possible in the official sources, every developer is free to obtain and maintain his/her own implementation of the DocIDServer (e.g. overriding isSeenBefore(String url)) and can implemented custom notification behaviour of duplicate URLs without doing some dirty value overriding in a custom CrawlController (for this reason and some others, I forked crawler4j in a separate git repo...).

So I think, we should discuss first, if we like to decouple the hard bindings in the CrawlController to offer - at least - the possibility to easily implement some custom duplicate URL behaviour.

Richa-b commented 6 years ago

Hi Are there any updates on this thread? Has this functionality been implemented?

DieterVDW commented 3 years ago

I just bumped into this, and I'd also appreciate this behaviour built-in! It is not a major issue if it isn't, but I feel like this can catch people by surprise.

For my use case: I am counting how many pages are in each node of a site's sitemap. I only discovered just now that many nodes count a lot of duplicates.

Not hard to solve on my side of course, but if this was built-in somehow and there was a setting to enable this, it would have 2 benefits:

  1. Very handy for people like me who want unique URLs
  2. Having an explicit setting for this would at least create awareness that crawler4j by default does NOT guarantee to deliver unique URLs