Closed GoogleCodeExporter closed 8 years ago
Hi Felix,
if I understand correctly you called DefaultExtractor#getText(URL) with an URL
like "http://www.äöü.xyz/".
This seems to be unsupported by Java 6 (see
http://java.sun.com/docs/books/tutorial/i18n/network/iri.html
). In particular, what you passed was then an IRI, not a URL.
A workaround for now could be creating the URLs like this
URL u = new URL("http://"+IDN.toASCII("www.äöü.xyz")+"/");
However, since the getText(URL) method is explicitly marked as "show case
only", you might also consider
using a dedicated HTTP client library, such as HttpClient
(http://hc.apache.org/) and call getText(InputSource)
instead.
I would not recommend using getText(URL) in a production setup. You will sooner
or later run into problems
that are out of scope for boilerpipe (robots.txt, broken servers, proxies, ...)
Marking as WontFix.
Best,
Christian
Original comment by ckkohl79
on 24 Jan 2010 at 4:09
Original issue reported on code.google.com by
feliz...@gmx.de
on 21 Jan 2010 at 2:32