xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Error during parsing when a link within a crawled page has a training % #115

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Crawl a page with a url that has a link with trailing %
2. For example, a page with the following link <a 
href="http://www.example.com/search?width=100%&height=100%

What is the expected output? What do you see instead?
An illegal argument exception is thrown in the Parser code.

Here's the stack trace
ERROR [Crawler 25] URLDecoder: Incomplete trailing escape (%) pattern, while 
processing: http://www.xxxxxxx.com/41274/PD/xxxxx.htm
java.lang.IllegalArgumentException: URLDecoder: Incomplete trailing escape (%) 
pattern
    at java.net.URLDecoder.decode(URLDecoder.java:187)
    at edu.uci.ics.crawler4j.url.URLCanonicalizer.percentEncodeRfc3986(URLCanonicalizer.java:209)
    at edu.uci.ics.crawler4j.url.URLCanonicalizer.canonicalize(URLCanonicalizer.java:191)
    at edu.uci.ics.crawler4j.url.URLCanonicalizer.getCanonicalURL(URLCanonicalizer.java:99)
    at edu.uci.ics.crawler4j.parser.Parser.parse(Parser.java:119)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:262)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:179)
    at java.lang.Thread.run(Thread.java:679)

What version of the product are you using?
3.1

Please provide any additional information below.

Original issue reported on code.google.com by raj...@indix.com on 25 Jan 2012 at 9:12

GoogleCodeExporter commented 9 years ago
This issue is already fixed in the source code version and will be included in 
the next release.

Thanks for reporting.
-Yasser

Original comment by ganjisaffar@gmail.com on 4 Feb 2012 at 11:41