Some Downloads not work anymore

wingman-jr-addon / wingman_jr

This is the official repository (https://github.com/wingman-jr-addon/wingman_jr) for the Wingman Jr. Firefox addon, which filters NSFW images in the browser fully client-side: https://addons.mozilla.org/en-US/firefox/addon/wingman-jr-filter/ Optional DNS-blocking using Cloudflare's 1.1.1.1 for families! Also, check out the blog!

https://wingman-jr.blogspot.com/

Other

35 stars 6 forks source link

Some Downloads not work anymore #202

Open Dragodraki opened 6 months ago

Dragodraki commented 6 months ago

On rare pages downloads not work anymore with Wingman Jr. (my current is 3.3.6) enabled (Firefox v. 115.8.0 esr). E.g. here: https://www.ibiblio.org/pub/micro/pc-stuff/freedos/files/distributions/1.0/

When clicking on a link, instead of a download the website shows cryptic content [STRG+S would be NOT the download]. This would have to be fixed urgently. Can you do this?

SufianBabri commented 6 months ago

I think I'm facing a similar issue with Wingman Jr. extension (v 3.3.6) on Firefox 123.0.1 (64-bit) on macOS.

If I disable this extension, I can download the image from https://dl3.pushbulletusercontent.com/yadayada.jpg (yadayada is not the actual filename obviously) fine.

wingman-jr-addon commented 5 months ago

Thanks for the report @Dragodraki @SufianBabri . @Dragodraki I think I can reproduce what you're talking about for the ISO's there - for example https://www.ibiblio.org/pub/micro/pc-stuff/freedos/files/distributions/1.0/fdboot.img @SufianBabri I'm getting the following XML even without Wingman Jr.: ` AccessDenied

Access denied.

` I'm assuming that's not what you'd expect?

At any rate, I'll take a peek and see if I can figure out what's going on.

wingman-jr-addon commented 5 months ago

Working on bisecting:

3.3.6 - As Dragodraki described, likely just not triggering download
3.3.0 - Same as 3.3.6 but not correctly replacing some of the characters
3.0.0 - Same as 3.3.0
2.0.1 - Same as 3.0.0

So, at least this doesn't seem to be a recent regression. My guess is its something to do with the content type handling for e.g. octet-stream and friends but we'll find out.

wingman-jr-addon commented 5 months ago

Ok, so the core issue seems to be that the .IMG isn't serving up a Content-Type at all. Here's what Wingman does:

I found a similar site, with similar type of content: http://ftp.vcu.edu/pub/gnu_linux/archlinux/iso/2024.04.01/ Here's what it does instead:

The application/octet-stream will trigger it to download.

So, big picture: Wingman Jr. has to do things around Content-Type so that it can properly translate/pass through characters. To do this, it has to force a specific content type. However, in this case, the original Content-Type is not specified and so then Firefox presumably runs its own smarter content type detection and determines it should download.

Now that might be the problem, but I'm not sure what the solution should be yet.

wingman-jr-addon commented 5 months ago

As a further resource for later, the "what to do when type isn't specified" can get complex, see for example https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

Dragodraki commented 5 months ago

Thanks for the report @Dragodraki @SufianBabri . @Dragodraki I think I can reproduce what you're talking about for the ISO's there - for example https://www.ibiblio.org/pub/micro/pc-stuff/freedos/files/distributions/1.0/fdboot.img @SufianBabri I'm getting the following XML even without Wingman Jr.: <Error> <Code>AccessDenied</Code> <Message>Access denied.</Message> </Error> I'm assuming that's not what you'd expect?

At any rate, I'll take a peek and see if I can figure out what's going on.

If the website does not work for you, maybe is is about regional limitation (geo-lock) - with VPN or TOR you should bypass that. Sorry I cannot give another example right now.

With Wingman Jr enabled the website shows trys to display download content as plain text instead of offer download - here is an excerpt of the weird characters:

"ë<LINUX4.1à@ð )ãúD FAT12 úü1ÀŽØ½|¸àŽÀ‰î‰ï¹ó¥ê^|à`ŽØŽÐf û€~$ÿuˆV$ÇFÀÇFÂèéFreeDOS‹v‹~vƒ×‰vÒ‰~ÔŠF˜÷fÆ×‰vÖ‰~Ø‹^±Óë‹F1Ò÷ó‰FÐÆƒ×‰vÚ‰~Ü‹FÖ‹VØ‹~ÐÄ^Zè›r/Ä~Z¹¾ñ}Wó¦_&‹EtƒÇ &€=uçrYPÄ^Z‹~‹FÒ‹VÔèk"

Normally it has to look like this (when disable the addon it does):

wingman-jr-addon commented 5 months ago

Yep, thanks @Dragodraki that is what I see too for your example. As noted above, the root of the issue is that the website doesn't send a Content-Type. Wingman Jr adds one, but then that means that Firefox can't use its own logic to properly infer a Content-Type. Usually downloads have a type of "application/octet-stream", but in this case Wingman Jr is incorrectly inferring a text type. I'm still trying to think the best way to actually fix this.

wingman-jr-addon commented 4 months ago

So mulling on this a bit further, apart from thinking how to modify the code the currently exists, there is still a core conundrum. For documents that don't supply a Content-Type, the default behaviour of browsers is to treat some as documents and some as downloads (as noted previously). Ones that are treated as documents should definitely be scanned, but ones that are not should probably go through the normal download process. However, in order to know which are which, Wingman cannot just defer the guessing process to Firefox. So, the only viable path through is to actually implement Content-Type sniffing. However, I do wonder how thorny this will get: while the stated standard algorithm is complex, I'm concerned that browsers may add their own extra detection logic on top and that forever I'll be reverse engineering that logic; still, it's probably better than the current state.

With respect to implementation, then, there's definitely work to do. Right now only the charset is sniffed, and that is in part dependent on Content-Type detection happening as a prerequisite. Now the case where Content-Type itself is sniffed has to be handled, and early return on the logic based on Content-Type appearing in the header won't suffice, so it could get yucky. However, at least it's clear from the above that Content-Type sniffing must be reimplemented for the rest to work correctly.

Both this and #201 are similar in that the API to do the scanning doesn't really expose enough of what the browser is doing and both require essentially re-implementing a core and complex part of what the browser itself does.

arthurmelton commented 2 months ago

I don't know if this would be the best way of implementing this, but you can have a look at the file command source code. It has magic definitions for pretty much every type of thing imaginable. It uses a "formatting" language to read some specific configuration files to use for checking the type of input. They have support for getting the name, mime, and ext of the data. I doubt you would want to try and write your own interpreter, but with the name being so generic I could not find any libraries to use this in JavaScript.

Dragodraki commented 2 months ago

First of all, thanks to everyone for helping in doing comments and providing details related to this issue.

@wingman-jr-addon The addon don't have to be perfect. Like for so many things in life, perfection should be the aim but not the reality. Maybe you can make Wingman Jr. a little bit smarter interpreting the content type but only as long the results are worth it. If you find a solution for the more common content types - like in my example - I will be fully satfisfied (you might close the issue then).

wingman-jr-addon commented 2 months ago

Thanks @Dragodraki @arthurmelton . I did finally fix #201 , and I think this has some similarities to it, with a key difference being that I can't "sniff" the MIME type after passing on the start of the request, I have to "sniff" before - but preferably only in the case when no MIME type is specified. So, I can probably implement just 7.1 from here: https://mimesniff.spec.whatwg.org/#mime-type-sniffing-algorithm