Open Vineeth-Mohan opened 8 years ago
I think this functionality is far from the purpose of crawler4j
A part of the core functionality of any crawler - cralwer4j included is to save all of the crawled urls in a special db (frontier) in order NOT to crawl them again.
I think this functionality shouldn't be included in this project (you can fork it and create your own...)
@yasserg this issue can be closed
@Chaiavi - I feel this is a very valid feature. For eg: , lets say i need to track number of inbound links to a web page. If i can get notified on discovering a duplicate page , looking at the parent link , i can tell if this inbound link is tracker or not. This way , i can maintain the number of inbound links to to a page and use it for , say implement something like google page rank algorithm.
@Vineeth-Mohan I worked a few months in extending crawler4j to be used over multiple physical machines and run into similar problems, e.g. you crawl the entire Web and you like to update Web-pages stored in your document repositories.
The responsible class for checking for duplicate URLs is called the DocIDServer
, which is assigned to the CrawlController
in its constructor. In order to get notifications, you could extend the DocIDServer
and take a special look at public boolean isSeenBefore(String url)
. This method decides, if the URL was seen before by any crawler thread.
@rzo1 - Looks good to me. I will go with it. But it feel this is a main stream feature request. @Chaiavi - Please comment
@rzo1 I have seen you around here many times, from your experience with crawler4j & crawlers overall do you think this functionality should be implemented in the crawler ?
For certain use-cases this feature might be very interesting, e.g. let's say we would like to:
resumeable=true
, I have to drop my gathered document id file database in order to perform an update
on my document repository. However, this is not very inefficient...Other use-case from a more scientifc view:
I think, that it would be a benefit to decouple the hard binding in the CrawlController
:
env = new Environment(envHome, envConfig);
docIdServer = new DocIDServer(env, config);
frontier = new Frontier(env, config);
If this would be possible in the official sources, every developer is free to obtain and maintain his/her own implementation of the DocIDServer
(e.g. overriding isSeenBefore(String url)
) and can implemented custom notification behaviour of duplicate URLs without doing some dirty value overriding in a custom CrawlController
(for this reason and some others, I forked crawler4j in a separate git repo...).
So I think, we should discuss first, if we like to decouple the hard bindings in the CrawlController
to offer - at least - the possibility to easily implement some custom duplicate URL behaviour.
Hi Are there any updates on this thread? Has this functionality been implemented?
I just bumped into this, and I'd also appreciate this behaviour built-in! It is not a major issue if it isn't, but I feel like this can catch people by surprise.
For my use case: I am counting how many pages are in each node of a site's sitemap. I only discovered just now that many nodes count a lot of duplicates.
Not hard to solve on my side of course, but if this was built-in somehow and there was a setting to enable this, it would have 2 benefits:
If the URL to be fetched is already seen , its discarded silently. We would like to have a mechanism where in , we should be able to get notification on this. Please provide a function that we can over ride to achieve this functionality.