wingman-jr-addon / wingman_jr

This is the official repository (https://github.com/wingman-jr-addon/wingman_jr) for the Wingman Jr. Firefox addon, which filters NSFW images in the browser fully client-side: https://addons.mozilla.org/en-US/firefox/addon/wingman-jr-filter/ Optional DNS-blocking using Cloudflare's 1.1.1.1 for families! Also, check out the blog!
https://wingman-jr.blogspot.com/
Other
35 stars 6 forks source link

False characters shown #199

Closed Dragodraki closed 7 months ago

Dragodraki commented 8 months ago

On website https://buerohaus-ahner.bueroshops.de all €-characters turned into ÿ-characters when your addon is enabled. Can you make Wingman Jr. allow this font type and display the €-characters on those websites again?

wingman-jr-addon commented 7 months ago

Thanks @Dragodraki , I'll start taking a look into the root cause for this specific site. I've hit so many cases already, I'm curious to see what the new case is. Here's a specific page that has a euro symbol on it: https://buerohaus-ahner.bueroshops.de/artikeldetails/standard/SCA226002/abfallbehaelter-metall-20-liter-wei-wandmontage-moeglich.html Without the addon: image With the addon: image

wingman-jr-addon commented 7 months ago

Well @Dragodraki this one was interesting, but also has been a likely source of problems for any of the characters appearing in the range 0x80 to 0x9F. See the reference here: https://www.i18nqa.com/debug/bug-iso8859-1-vs-windows-1252.html

Here's what happened. I looked into the matter, and discovered that the character encoding declared by the web page on the heading was clearly iso-8859-1. Unlike many of the other bugs, the actual character set detection logic was fully working and it wasn't a new edge case for that. However, I noticed that Firefox was listing the Text Encoding as Windows 1252. So, it turns out that iso-8859-1 is special for legacy reasons. iso-8859-1 is actually treated as an alias for Windows 1252. Windows 1252 varies from iso-8859-1 for the character range 0x80 to 0x9F, so this seemed like the problem. But why was this failing? Well, as it turns out, Firefox has a TextDecoder and a TextEncoder. The TextDecoder can be instantiated with any valid label, such as 'windows-1252', but the TextEncoder only supports UTF-8. At a macro level, the addon needs to decode/re-encode text in the target charset. So not being able to encode in a specified target charset is a problem. I had a special routine that would encode iso-8859-1. Well, I was assuming that the only time this would be used is when it had already decoded iso-8859-1 input, in which case with true iso-8859-1 all Unicode character codes would be in the range 0-255. As a safety, I clamped anything greater at 255. However, in reality, when windows-1252 masquerading as iso-8859-1 was being decoded, it would generate Unicode character codes greater than 255 for any of the special cases where windows-1252 and iso-8859-1 differed. So, know that my actual input to the text encoding was windows-1252, I added a lookup table to convert those special characters back into Windows 1252 byte encodings and now it looks like it's working. I'd like to maybe poke on this further but at least the page above now generates properly.

Dragodraki commented 7 months ago

Thanks for your very quick support, again! :) Since the problem is not solved yet, I suppose you'll fix it in your next version of Wingman, won't you?

wingman-jr-addon commented 7 months ago

Yes that's the plan @Dragodraki. Right now I'm putting much of my efforts into a next generation of detection model over at https://github.com/wingman-jr-addon/model/issues/7.

In the meantime if you're feeling adventurous you can give the branch a try. Setting it up is as easy as 1) cloning, 2) checking out the branch, 3) going to about:debugging -> This Firefox -> Load Temporary Addon and picking the manifest.json file.

Thanks for your continued bug reports - they will make the addon so much better for international users!

Dragodraki commented 7 months ago

So, no fix - it's okay, since its about this single site only.

You mentioned you work on next generation detection model. Now I am curious, do you mind telling a bit about it? So many addons and also general programs are forks of others or pushed aside by standard apps from big manufacturers, there is barely anything new out there - that's why I like to try new programs.

wingman-jr-addon commented 7 months ago

@Dragodraki Well I'm posting some progress about the new model over at the other issue https://github.com/wingman-jr-addon/model/issues/7 , but in short I'm looking at some of the research that's been done in the last 2 years that combines the advances of vision transformers with convolutional networks as well as some of the robustness pretraining approaches like CLIP and DINO.

wingman-jr-addon commented 7 months ago

(I'm going to leave this open until I have the PR merged)

wingman-jr-addon commented 7 months ago

@Dragodraki I added a test file, see #200 for visual differences in character translation.