xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Different keys used for PUT and REMOVE operations on DB #156

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Run a crawler (mine was build from image crawler)
2. Watch the logs
3. "Could not remove..." is written out for every database entry.

What is the expected output? What do you see instead?
Anything that goes in must come out.  I created a database helper class to 
consistently generate the same key for the same URL.

What version of the product are you using?
3.3

Please provide any additional information below.
The database helper class was used to account for key generation mismatches in 
WorQueues.java (put) and InProcessPagesDB.java (removeURL).

Original issue reported on code.google.com by dpaulal...@gmail.com on 28 May 2012 at 2:04

GoogleCodeExporter commented 9 years ago
I got similar problem. 
Basically I just downloaded souce from https://code.google.com/p/crawler4j/, 
run the BasicCrawlController test class with  
config.setResumableCrawling(true) and for every entry warn "Could not remove: 
http://www.ics.uci.edu/about/ from list of processed pages" is writtern.
But when I run using using config.setResumableCrawling(false) everything works 
fine.

Original comment by asset.d...@gmail.com on 6 Nov 2012 at 2:15

GoogleCodeExporter commented 9 years ago
I got the same... I download from source and when I run using 
setResumableCrawling(true) the error appear..

Tks

Original comment by inde...@gmail.com on 21 Nov 2012 at 9:30

GoogleCodeExporter commented 9 years ago
This is fixed in this changelist: 
https://code.google.com/p/crawler4j/source/detail?r=3615d8f58ea3072b85007611a547
cf21bccbfb18

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 2 Mar 2013 at 10:15