Closed mvolz closed 5 years ago
Thanks for the code snippets! But I think JSDOM has all the necessary logic to detect the encoding, because it really tries to mimic the browser. It already uses an advanced encoding sniffer which gets the encoding from meta
tags or BOM, but the problem is that in this specific case when we are passing a buffer, there is no way to pass the encoding from header. What is incorrect JSDOM behavior, I think. I opened an issue, and we will see how it goes. If they will fix this, we won't even need to change anything in translation-server
code, because we are already passing everything JSDOM needs (content-type header). Otherwise the only option left would be to move all the encoding detection logic to translation-server
and pass JSDOM the already decoded content. But it would deny the purpose of JSDOM.
It looks that it's going to take same time until JSDOM will fix this problem from their side. So I made a temporary fix #81.
We've deployed this and it's working great, thank you so much! Should I close the ticket or are we keeping it open until the jsdom thing goes through?
Not all websites are correctly decoded, so these don't give great results, examples:
http://nna-leb.gov.lb/ar/show-report/371/ nonsense returned
more nonsense (unicode) https://www.ynet.co.il/articles/0,7340,L-5037054,00.html
https://www.insee.fr/fr/statistiques/zones/2021173 wrong accents
czech accents https://zpravy.idnes.cz/george-soros-osobnost-roku-the-financial-times-frg-/zahranicni.aspx?c=A181219_100000_zahranicni_kha
Here's some snippets of code where we deal with this:
in the Request we set encoding to null so we get the page back in Buffer instead of the default which is utf-8:
And then try to get the content type: