xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

handlePageStatusCode triggers only once for the same Broken Link ? #120

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Assume xyz.com is a broken link that appears in the Homepage & 3 other 
internal pages.
2. The crawler reports the broken link when it finds it in the homepage
3. It doesnt report it in 3 other internal pages.

What is the expected output? What do you see instead?
The broken link must be reported everytime it is seen in all pages. The current 
implementation is useful only when the user creates a new page that is 
broken(as the crawler doesnt report it in the future). But, if the user has 
permanently removed the page, then fixing in one page doesnt solve the problem. 
During the next crawl, it reports in another page.. and so on till the broken 
link is not found.

What version of the product are you using?
Latest from source.

Please provide any additional information below.

Original issue reported on code.google.com by w3engine...@gmail.com on 5 Feb 2012 at 4:49

GoogleCodeExporter commented 9 years ago
Hi Yasser..
Any updates on this ?

Original comment by w3engine...@gmail.com on 15 Feb 2012 at 10:52

GoogleCodeExporter commented 9 years ago
This is by design. Crawler4j is designed for crawling domains and extracting 
content, not to detect broken links. This is a specific need for your 
application. Of course, you can customize it to support your scenario. For 
example, whenever, you see a broken link, keep it somewhere (memory, db, ...) 
and whenever a new page is visited go through its links and see if any of the 
known broken links is among them.

-Yasser

Original comment by ganjisaffar@gmail.com on 17 Feb 2012 at 5:12