Open Vineeth-Mohan opened 8 years ago
I am not sure that crawler4j is the right tool for the task
Crawler4j's purpose is to crawl the web and find new urls and new content till covering the whole wide web.
For that purpose, crawler4j won't crawl the same url twice.
You can hack it to do anything you want but this one is far from the original intent of crawler4j, so I am not sure you will find your answer in this project.
You are invited to fork this one and create a monitoring tool using crawler4j
Or creating a slim app that does your task (monitors a predefiined set of urls to see if there is new content).
Or try to find a tool that does it (Nagios?)
@yasserg this issue can be closed
I think, you are looking for two things:
DocIDServer
in order to allow for duplicate URLsnextURL
logic, the clearest thing is to influence the priority
of the WebURL
class in your shouldVisit
implementation. Based on this byte value, the URLs are retrieved from the underlaying Berkley databse.@rzo1 - That is exactly what i want. I would also love to have one more thing. I would like to pass some context of the webURL too. As of now , I have modified the WebURL class to hold a info field ( String ). I have changed the code to serialize and deserialize WebURL instance to accommodate this. Now I will do the following
WebURL url = new WebURL();
url.setURL(rssLink);
url.setInfo("{ 'type' : 'rss' , ;source' : 'ibn' }");
controller.getFrontier().schedule(url);
This will allow me to judge the kind of protocol i am handling and process based on it.
@Vineeth-Mohan While implementing focused crawling I had to influence the WebURL attributes and the storage of them too... However, instances of the WebURL
class are created at a few points via new WebURL()
.
@Chaiavi @yasserg Maybe it would be a benefit, to introduce a WebURLFactory
in order to allow custom implementations of WebURL
and related serialize and deserialize features. This can be usefull, if a developer needs some more information about a WebURL
, e.g. for a focused crawler the probability of being part of the focused domain or not...
@rzo1 - Yes , WebURLFactory looks perfect.
Thanks for making the crawler4j application. It is indeed complete and useful. What i am looking for is a RSS crawler which can take out the links from RSS and get new ones in a periodic manner. But then crawler4j does not have the support for the same.
I was wondering if we can detach the nextURLS logic out of the core engine , so that we can extend this mechanism to define how to get next set of URLS for RSS or twitter page or may be facebook pages too.