Cyrillic not working correctly

wingman-jr-addon commented 9 months ago

User Ayaya reported in August 2023:

Wingman Jr. replacing text characters on rutracker.org : with extension enabled, all characters on pages of this domain change from legible cyrillic characters to the same cyrillic я character, such that Горгулья changes into яяяяяяя.

And indeed I see that even with the recent 3.3.6: Vs correct:

If I were to guess, it's that the charset is neither UTF-8 nor ISO-8859-1/Windows-1252. Could get a little spicy.

arthurmelton commented 4 months ago

I was doing some testing to try and solve the issue, but I was getting deep in the weeds and I got lost trying to follow what the program does. What I did figure out though was that the text is actually reported as being Windows-1252.

The code follows a pretty normal route but ends up getting to here https://github.com/wingman-jr-addon/wingman_jr/blob/3d3f3084887a694d698db4a9432f39e86b2b9793/background.js#L878-L900 and all of the Russian characters are being turned into 255.

I ended up changing that code to

function TextEncoderISO_8859_1()
{
    this.encode = function(str) {
        var result = "";
        const textEncoder = new TextEncoder();
        for(let i=0; i<str.length; i++) {
            let charCode = str.charCodeAt(i);

            //So this is subtle. ISO-8859-1 is actually interpreted as Windows-1252 on decoding by browsers.
            //When the TextDecoder got instantiated with 'iso-8859-1', it actually used Windows-1252.
            //This means that for the characters in the (decimal) range 128-159, the Unicode point is not
            //in the normal 0-255 range and we need to detect these characters specially to back-convert into
            //Windows-1252 raw encoding masquerading as ISO-8859-1.
            charCode = BK_Windows1252_special_chars[charCode] || charCode;

            result += String.fromCharCode(charCode);
        }

        return textEncoder.encode(result);
    }
}

and this looked to do something. All of the characters are off when rending on the page, but when doing console logs the text was correct. I could not really figure out how to fix it past this point, because every place I would look and tell it to log what it sees, it always looked correct, even in processor.js. I hope this helps atleast a little bit though.

wingman-jr-addon commented 4 months ago

Thanks @arthurmelton , I'll dig in here a bit more as well as comment on how I go about debugging this sort of stuff.

So, first things first for debugging, I need to do two things: 1) determine whether the code I'm looking at is in the main background script context or in the worker context and 2) enable the debug logs.

To determine the context, all code in background.js and directly called from that are in background.js. In contrast, processor.js is essentially a worker client that connects. For the main background script, you'll use the usual "Inspect" and Console from the extension. However, if it's in processor.js, you have to open the hidden page and then do console work etc. from there.

To enable debug logs, execute the following in the main background Console: bksSetAllLogging(true). You will start seeing many, many more logs.

Now, back to this issue. When I determined it was in the main background context and enabled logging, then visited rutracker.org and filtered on "CHARSET: Det", I see the following: This leads me to think that the actual "source" charset is Windows-1251, which is Cyrillic. So the charset is indeed detected correctly.

So architecturally, why is this a problem? Well, consider what's happening with the page being served and filtered. The filter takes in a stream of bytes and has to return a stream of bytes. But, in order for us to do anything useful during the scanning step, the bytes need to be converted to a string. So the full workflow is roughly: Encoded page bytes -> Decoded page string -> Do image scanning stuff -> Mutated decoded page string -> Encoded page bytes

But, Mozilla has asymmetry in the API that they provide: the TextDecoder supports all the character encodings, but TextEncoder does not. So what happens here (and why your string shows up correctly - as does the warning debug statement about mistranslated characters) is that the page bytes get properly decoded from Windows-1251 into the internal string format, which views characters as code points. So far so good. But the page render fails because TextEncoder does not support Windows-1251, and this is the fallback code path.

TL;DR - The lack of additional TextEncoding charset support creates an asymmetric situation where correctly decoded text can't get encoded after filtering.

wingman-jr-addon commented 4 months ago

I'm looking into creating an extended TextEncoder to better support this use case.

arthurmelton commented 4 months ago

I did not realize that all of this was because of TextEncoder not being able to encode back the text. Thank you for explaining how you went through to understand the problem more though!

wingman-jr-addon commented 4 months ago

@arthurmelton I'm working on a solution here if you're interested. Haven't removed all the old code or anything, but did a little bit of work to try to recreate a number of the encodings by leveraging a .NET program to map out the encodings after matching up to the MDN list of supported decodings. https://github.com/wingman-jr-addon/wingman_jr/tree/extended-text-encoding

arthurmelton commented 4 months ago

I will take a look when I have some time, but that might take untill next weekend. I am currently working on a pr for a different project at the moment, and that might take all my free time during the week.

wingman-jr-addon commented 4 months ago

All right, I think the approach I've taken here will be a significant improvement. It still does not completely fix all encodings, and the the theoretical approach is still a bit over-simplistic view of Unicode by not considering multi-codepoint aspects, but it should generally help with some of the more basic iso-8859-x family of encodings.

wingman-jr-addon / wingman_jr

Cyrillic not working correctly #201