What steps will reproduce the problem?
1. Parse the XML file,
2. Find that the Unicode encoding is plain wrong
Not sure how this wasn't noticed as a serious error before?
I'm not sure what I'm doing differently, that requires this, and how it
would work for others? Strange.
For me, adding "UTF8" as the encoding of the InputStreamReader, fixed
everything so the Unicode characters are read in correctly.
protected InputSource getInputSource() throws Exception
{
BufferedReader br = null;
if(wikiXMLFile.endsWith(".gz")) {
br = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream(wikiXMLFile)), "UTF8"));
} else if(wikiXMLFile.endsWith(".bz2")) {
FileInputStream fis = new FileInputStream(wikiXMLFile);
byte [] ignoreBytes = new byte[2];
fis.read(ignoreBytes); //"B", "Z" bytes from commandline tools
br = new BufferedReader(new InputStreamReader(
new CBZip2InputStream(fis), "UTF8"));
} else {
br = new BufferedReader(new InputStreamReader(
new FileInputStream(wikiXMLFile), "UTF8"));
}
return new InputSource(br);
}
Original issue reported on code.google.com by ianupri...@gmail.com on 1 Apr 2010 at 4:37
Original issue reported on code.google.com by
ianupri...@gmail.com
on 1 Apr 2010 at 4:37