Crawler never stops and repeats URL

GoogleCodeExporter commented 9 years ago

Hello,

I am crawling a website but I am facing a severe problem: My problem is that 
the crawler once starts never stops and keeps on repeating to crawl same URL 
even if it founds different param value every time for same URL. So I have 
filtered URL before it crawls i.e. removing unnecessary params appended to url 
before it goes to temp storage. Like
in this url I want to remove the content after "pid"

http://www.flipkart.com/hp-notebook-g-series-g4-1201tu-laptop/p/itmdfx3mf62yjhyd
?pid=COMDFX3G9NMHUMZB&_l=Fn4WNNUnIJAhmSh90VjZnw--&_r=OyYYz5HR3HEeh3r+23hvlA--&re
f=ec8d9744-24c4-468f-8e24-49ab3cc74dd6

I've tried to patch it by doing following steps in

http://code.google.com/p/crawler4j/source/browse/src/main/java/edu/uci/ics/crawl
er4j/parser/HtmlContentHandler.java 

After Line #71
if (href != null) {
if (href.indexOf("?pid=") > 0) {
href = href.substring(0, href.indexOf("?pid="));
}
outgoingUrls.add(href);
}

It works fine but the problem is the crawling process never stops and 
repeating. Please let me know where should I patch the code.
Your effort and suggestions are valuable to me.

Thanks in advance.

Original issue reported on code.google.com by hiren2hi...@gmail.com on 11 Mar 2012 at 5:36

GoogleCodeExporter commented 9 years ago

If you're crawling a single domain, I suggest to do this: for each URL that you 
receive in shouldVisit, extract the initial part and then generate a CRC32 
checksum for it. Keep the checksums in a hashset and for URLs that you get in 
shouldVisit make sure that the checksum is not seen before. This way you don't 
need to change crawler4j code.

-Yasser

Original comment by ganjisaffar@gmail.com on 12 Mar 2012 at 3:11

Changed state: WontFix
Added labels: Type-Other
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Sorry to resurrect an old post.  Yasser can you please give an example of what 
you're talking about?  I have the same issue and need a solution.

Would it not be better to checksum the whole URL rather than the initial part?

Original comment by ldrive...@gmail.com on 17 Feb 2013 at 7:32

xrma / crawler4j

Crawler never stops and repeats URL #134