wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
138 stars 44 forks source link

Wrong encoding problem #35

Closed midudev closed 7 years ago

midudev commented 8 years ago

Hi there! First of all, thank for your great work on this piece of software. :+1:

Now, I'm using it with some websites and it gets a wrong charset encoding. For example with http://elmundo.es, it's getting some weird chars. Any advice? I could try to take a look to the package and send a Pull Request if I'm able to fix that. :+1:

mvolz commented 8 years ago

Yes, this is a problem we've had with the request library as well- they only have a limited number of encoding types so when the request library automatically decodes the Buffer type, it can come out poorly decoded.

What we do in citoid is we use the iconv-lite node library which has a more comprehensive number of encoding types: https://www.npmjs.com/package/iconv-lite

Then we make the request for html in the raw Buffer format, not a decoded format:

var options = {
        url: url,
        encoding: null, // returns page in Buffer object
    }

Then we try to figure out how the page is encoded. This is actually somewhat error prone because although sites SHOULD include the encoding type in the HTTP Header they don't always do this, and sometimes it's actually written in the html itself, so you have to decode the buffer in utc, scrape what the actual coding is from the decoded html, and then decode it again using the correct content type this time. You can see how we do this in citoid with the two functions, contentTypeFromResponse (which gets the contentType from the Http header) and contentTypeFromBody (which tries to find the contentType in the html) in this file: https://github.com/wikimedia/citoid/blob/master/lib/Scraper.js to get the contentType.

Then we decode the buffer using iconv

var str = iconv.decode(response.body, contentType);
var $ = cheerio.load(str);

Then we use the html-metadata methods on the cheerio object.

Hope this helps!

midudev commented 8 years ago

Thanks @mvolz for such detailed and kind answer. :)

I would wish that content type typo to be handled inside the package, but your solution is good enough for now. I'm using req-fast (https://www.npmjs.com/package/req-fast) in order to get the request and it deals automatically with any content-type. After that, I'm using your provided solution. Thank you so much!