Non HTML Crawl Mechanism using plugin architecture

yasserg / crawler4j

Open Source Web Crawler for Java

Apache License 2.0

4.55k stars 1.93k forks source link

Non HTML Crawl Mechanism using plugin architecture #125

Open Vineeth-Mohan opened 8 years ago

Vineeth-Mohan commented 8 years ago

Thanks for making the crawler4j application. It is indeed complete and useful. What i am looking for is a RSS crawler which can take out the links from RSS and get new ones in a periodic manner. But then crawler4j does not have the support for the same.

I was wondering if we can detach the nextURLS logic out of the core engine , so that we can extend this mechanism to define how to get next set of URLS for RSS or twitter page or may be facebook pages too.

Chaiavi commented 8 years ago

I am not sure that crawler4j is the right tool for the task

Crawler4j's purpose is to crawl the web and find new urls and new content till covering the whole wide web.

For that purpose, crawler4j won't crawl the same url twice.

You can hack it to do anything you want but this one is far from the original intent of crawler4j, so I am not sure you will find your answer in this project.

You are invited to fork this one and create a monitoring tool using crawler4j

Or creating a slim app that does your task (monitors a predefiined set of urls to see if there is new content).

Or try to find a tool that does it (Nagios?)

Chaiavi commented 8 years ago

@yasserg this issue can be closed

rzo1 commented 8 years ago

I think, you are looking for two things:

A possibility to modify and manage your own DocIDServer in order to allow for duplicate URLs
To influence the nextURL logic, the clearest thing is to influence the priority of the WebURL class in your shouldVisit implementation. Based on this byte value, the URLs are retrieved from the underlaying Berkley databse.

Vineeth-Mohan commented 8 years ago

@rzo1 - That is exactly what i want. I would also love to have one more thing. I would like to pass some context of the webURL too. As of now , I have modified the WebURL class to hold a info field ( String ). I have changed the code to serialize and deserialize WebURL instance to accommodate this. Now I will do the following


        WebURL url = new WebURL();
        url.setURL(rssLink);
        url.setInfo("{ 'type' : 'rss' , ;source' : 'ibn' }");

        controller.getFrontier().schedule(url);

This will allow me to judge the kind of protocol i am handling and process based on it.

rzo1 commented 8 years ago

@Vineeth-Mohan While implementing focused crawling I had to influence the WebURL attributes and the storage of them too... However, instances of the WebURL class are created at a few points via new WebURL().

@Chaiavi @yasserg Maybe it would be a benefit, to introduce a WebURLFactory in order to allow custom implementations of WebURL and related serialize and deserialize features. This can be usefull, if a developer needs some more information about a WebURL, e.g. for a focused crawler the probability of being part of the focused domain or not...

Vineeth-Mohan commented 8 years ago

@rzo1 - Yes , WebURLFactory looks perfect.