xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Internal error in WebURL #131

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
While crawling the seed http://eventiesagre.it/ I obtain the internal error 
reported below.
I guess the issue is due the crawler finds a URL without a final / .

Processing page: [http://eventiesagre.it/]
Processing page: 
[http://eventiesagre.it/Eventi_Mostre/21033267_Museo+della+Bilancia.html]
Processing page: 
[http://eventiesagre.it/Eventi_Mostra+Mercato/21055275_Per+Corti+E+Cascine.html]
java.lang.StringIndexOutOfBoundsException: String index out of range: -2
    at java.lang.String.substring(String.java:1937)
    at edu.uci.ics.crawler4j.url.WebURL.setURL(WebURL.java:87)
    at edu.uci.ics.crawler4j.parser.Parser.parse(Parser.java:133)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:276)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189)
    at java.lang.Thread.run(Thread.java:680)

Original issue reported on code.google.com by michele.mostarda on 2 Mar 2012 at 2:03

GoogleCodeExporter commented 9 years ago
Hi,
Would you please let me know which version are you using? I tried this domain 
and couldn't reproduce this bug. There has been bugs in WebURL that I have 
fixed in the latest version.

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 5 Mar 2012 at 6:52

GoogleCodeExporter commented 9 years ago
I've seen this issue since a striaght upgrade to 3.3 as well.

Original comment by DarenDa...@gmail.com on 6 Mar 2012 at 11:15

GoogleCodeExporter commented 9 years ago
I also experience this bug with 3.3.

Original comment by try6...@gmail.com on 16 Aug 2012 at 8:40

GoogleCodeExporter commented 9 years ago
I believe this is caused when the url is empty.
I would add some validation of the url, to make sure it's not null or empty, 
and also some checks related to the value of domainEndIdx, to make sure it's 
not negative or smaller than domainStartIdx, which would cause the substring 
command to fail.

Original comment by try6...@gmail.com on 16 Aug 2012 at 10:19

GoogleCodeExporter commented 9 years ago
I ran across this issue while trying to crawl boingboing.net. After doing some 
digging, I discovered that the way boingboing does their "share article" -> 
email (javascript) breaks the crawler.

The issue is that the "mailto" link doesn't supply an email address; instead it 
says "type email address here" so the if statement specifying "@" in 
Parser.parse() doesn't get hit.

Original comment by uva...@gmail.com on 2 Jan 2013 at 3:43

GoogleCodeExporter commented 9 years ago

Original comment by avrah...@gmail.com on 18 Aug 2014 at 3:13

GoogleCodeExporter commented 9 years ago
http://eventiesagre.it/
& 
http://boingboing.net/

Are getting crawled without any incident.

If anybody sees any error related to this bug then please report

Original comment by avrah...@gmail.com on 19 Aug 2014 at 2:40

GoogleCodeExporter commented 9 years ago
Fixed original IndexOutOfBoundsException in revision: 65954e30f219  

Original comment by avrah...@gmail.com on 19 Aug 2014 at 3:37