yanwong / ganon

Automatically exported from code.google.com/p/ganon
0 stars 0 forks source link

Auto charset conversión for getPlainText()? #16

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I'm scrapping a web page in iso-8859-1, but my scripts works in UTF-8 (php 
code, mysql databases, etc), so.. if I get the text of a node, getPlainText() 
returns the text in iso-8859-1 (the charset oh the loaded html) and I cant make 
equality comparisions in my code.

I solved this (for this particular case) converting to UTF-8 in the 
getPlainText implementation:

function getPlainText() {
    return preg_replace('`\s+`', ' ', utf8_encode( html_entity_decode($this->toString(true, true, true), ENT_QUOTES) ));
}

but... I'm thinking... what about an automatic detection of the loaded html 
encoding and one option to set the charset for the result strings of 
getPlainText()?

I's just an idea O:)

Original issue reported on code.google.com by Radika...@gmail.com on 6 Sep 2012 at 7:50

GoogleCodeExporter commented 8 years ago
I also noticed ganon has trouble handling GB2312 (Simplified Chinese). I ended 
up having to use iconv to convert to GBK before parsing, which is pretty slow 
for larger DOMs. Rules for charset conversions can be tricky.

Original comment by sjwood...@gmail.com on 18 Oct 2012 at 1:52

GoogleCodeExporter commented 8 years ago
I don't think it's a good idea to alter getPlainText, but maybe an extra method 
called getPlainTextUTF8? Perhaps might be better to just use a local solution, 
though.

Original comment by niels....@gmail.com on 19 Oct 2012 at 4:34

GoogleCodeExporter commented 8 years ago
The problem is... sometimes I dont know (or I dont want to know) in wich 
charset is the input webpage... so any kind of autodetection would be great so 
I can use my code always in the same charset

Original comment by Radika...@gmail.com on 19 Oct 2012 at 4:39

GoogleCodeExporter commented 8 years ago
Added a simple version of getPlainTextUTF8 in rev #76.

Original comment by niels....@gmail.com on 20 Oct 2012 at 10:45