Bad encoding from some websites

mvolz commented 5 years ago

Not all websites are correctly decoded, so these don't give great results, examples:

http://nna-leb.gov.lb/ar/show-report/371/ nonsense returned

more nonsense (unicode) https://www.ynet.co.il/articles/0,7340,L-5037054,00.html

https://www.insee.fr/fr/statistiques/zones/2021173 wrong accents

czech accents https://zpravy.idnes.cz/george-soros-osobnost-roku-the-financial-times-frg-/zahranicni.aspx?c=A181219_100000_zahranicni_kha

Here's some snippets of code where we deal with this:

const contentType = require('content-type');
const iconv = require('iconv-lite');

in the Request we set encoding to null so we get the page back in Buffer instead of the default which is utf-8:

encoding: null,

And then try to get the content type:


                    const defaultCT = 'utf-8'; // Default content-type
                    let contentType = contentTypeFromResponse(response);

                    // Load html into cheerio object; if necessary, determine
                    // content type from html loaded with default content-type, and
                    // then reload again if non-default content-type is obtained.
                    if (contentType) {
                        // Content Type detected in response
                        try {
                            str = iconv.decode(response.body, contentType);
                            chtml = cheerio.load(str);
                        } catch (e) {
                            logger.log('warn/scraper', e);
                        }
                    } else {
                        str = iconv.decode(response.body, defaultCT);
                        try {
                            chtml = cheerio.load(str);
                            contentType = contentTypeFromBody(chtml);
                            // If contentType is scraped from body and is NOT the default
                            // CT already loaded, re-decode and reload into cheerio.
                            if (contentType && contentType !== defaultCT) {
                                try {
                                    str = iconv.decode(response.body, contentType);
                                    chtml = cheerio.load(str);
                                } catch (e) {
                                    // On failure, defaults to loaded body with default CT.
                                    logger.log('warn/scraper', e);
                                }
                            }
                        } catch (e) {
                            logger.log('warn/scraper', e);
                        }
                    }

/**
 * Get content type from response header 
 * @param  {Object} response response object with Buffer body
 * @return {string}          Content-type string or null
 */
function contentTypeFromResponse(response) {

    // Try to get content-type from header
    try {
        const obj = contentType.parse(response);// Parsed content-type header
        if (obj.parameters && obj.parameters.charset) {
            return obj.parameters.charset;
        }
    } catch (e) { // Throws a TypeError if the Content-Type header is missing or invalid.
        return null;
    }

}

/**
 * Get content type from the metadata tags in a response
 * object with cheerio loaded body with default encoding
 * @param  {Object} chtml    Cheerio object
 * @return {string}          Content-type string or null
 */
function contentTypeFromBody(chtml) {
    // TODO: Stream and read buffer with regex
    // i.e. <meta charset="iso-8859-1" />
    const charset = chtml('meta[charset]').first().attr('charset');
    if (charset) { return charset; }

    // Case insensitive since content-type may appear as Content-Type or Content-type
    let contentTypeHeader = chtml('meta[http-equiv]').filter(function() {
        // eslint-disable-next-line no-invalid-this
        return (/content-type/i).test(chtml(this).attr('http-equiv'));
    });
    if (contentTypeHeader) {
        // <meta http-equiv="Content-type" content="text/html; charset=iso-8859-1">
        contentTypeHeader = contentTypeHeader.first().attr('content');
    } else { return null; }

    if (contentTypeHeader) {
        try {
            const obj = contentType.parse(contentTypeHeader);// Parsed content-type header
            if (obj.parameters && obj.parameters.charset) {
                return obj.parameters.charset;
            }
        } catch (e) { // Throws a TypeError if the Content-Type header is missing or invalid.
            return null;
        }
    }

    return null;
}

mrtcode commented 5 years ago

Thanks for the code snippets! But I think JSDOM has all the necessary logic to detect the encoding, because it really tries to mimic the browser. It already uses an advanced encoding sniffer which gets the encoding from meta tags or BOM, but the problem is that in this specific case when we are passing a buffer, there is no way to pass the encoding from header. What is incorrect JSDOM behavior, I think. I opened an issue, and we will see how it goes. If they will fix this, we won't even need to change anything in translation-server code, because we are already passing everything JSDOM needs (content-type header). Otherwise the only option left would be to move all the encoding detection logic to translation-server and pass JSDOM the already decoded content. But it would deny the purpose of JSDOM.

mrtcode commented 5 years ago

It looks that it's going to take same time until JSDOM will fix this problem from their side. So I made a temporary fix #81.

mvolz commented 5 years ago

We've deployed this and it's working great, thank you so much! Should I close the ticket or are we keeping it open until the jsdom thing goes through?

mrtcode commented 5 years ago

I think we can close this. I'll be watching the JSDOM issue.

zotero / translation-server

Bad encoding from some websites #77