Closed abelabbesnabi closed 1 month ago
This is a character encoding issue. By default cheerio loads the html as 'utf-8' charset. We handle this outside of the library.
It looks like the charset is as a tag in the html here:<meta charset="windows-1251">
We basically load as utf-8 and check the charset, and then reload it again after decoding using the iconv-lite library.
Something like
var str = iconv.decode(response.body, 'windows-1251');
var $ = cheerio.load(str);
return parseAll($).then(function(metadata){
console.log(metadata);
});
And then scrape the page, should hopefully work. Unfortunately this means you need to load the page into cheerio twice, once to find the charset, and then again to get the metadata.
Thank you Marielle. I'll give it a try.
While scrapping a url that is in a language different than English, such as Russian, the title and description in the metadata are returned as gibberish
Here is a link example: https://pikabu.ru/story/privet_fsbshniki_mne_drug_byivshiy_sotrudnik_fsb_rasskazal_chto_vyi_tut_sidite__2821880