xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

crawling of sites within mailto: #126

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
<b>What steps will reproduce the problem?</b>
1.crawling websites, that contain "mailto:"
2.for example http://www.heise.de/index.html as Seed
3.

<b>What is the expected output? What do you see instead?</b>
expected: an sucessfull crawl.
instead : StringIndexOutOfBoundsException in WebURL.java

<b>What version of the product are you using?</b>
crawler4j 3.3.1

<b>Please provide any additional information below.</b>
The exception is thrown at WebURL.java on line 87 after a call of Parser.java 
on line 133

<b>after changing the code at line 118 in Parser.java From:</b> 
if (!hrefWithoutProtocol.contains("javascript:") &&  
    !hrefWithoutProtocol.contains("@")) {

<b>To:</b>
if (!hrefWithoutProtocol.contains("mailto:") && 
    !hrefWithoutProtocol.contains("javascript:") &&  
    !hrefWithoutProtocol.contains("@")) {

it works for me.

Original issue reported on code.google.com by nuex...@googlemail.com on 23 Feb 2012 at 10:36

GoogleCodeExporter commented 9 years ago

Original comment by ganjisaffar@gmail.com on 3 Mar 2013 at 8:07