weihanglo / sfz

A simple static file serving command-line tool written in Rust.
https://crates.io/crates/sfz
Apache License 2.0
400 stars 30 forks source link

fix: guess charset naively #77

Closed weihanglo closed 3 years ago

weihanglo commented 3 years ago

Fixes #76.

This bug was introduced by #69. In order not to perform any additional I/O, we guess charset naively from MIME. Mainly

ZacJW commented 3 years ago

Just compiled this and this change does fix #76, so it's fine for my usecase. Can't help thinking that a more thorough check may later prove necessary.

As a quick test I had sfz serve a UTF-16 LE enocded .txt file and the content type header was mistakenly text/plain; charset=utf-8. I would argue that if sfz is not going to verify the file is of any particular charset, it should leave it out of the content-type altogether and let the client figure it out rather than lead them down the wrong path.

If reporting the charset proves necessary for some usecase, maybe sfz could check the encoding of the file (possibly using encoding_rs if introducing a dependency is acceptable) and cache the determined encoding to avoid checking every time.

Thanks for the quick fix, hope I've not made too much work for you.

ZacJW commented 3 years ago

I've done a bit more investigation so I'll share it here. As I understand it, your decision to default to charset=utf-8 was informed by #68 in which they have an issue viewing yaml with non ASCII characters in chrome when served by sfz 0.6.0. However as per yaml 1.2.2 spec (and since 1.0) all yaml character streams are unicode, defaulting to utf-8 encoding. This leads me to believe that the bug in #68 is acutally a bug in chrome, not sfz. I've reproduced that bug with sfz 0.6.0 and with a modified sfz 0.6.1 that never includes a charset in the content-type header. This is all viewed with chromium 94.0.4606.71.

weihanglo commented 3 years ago

Thanks for the review and research! Your concern is reasonable. And yes, this is a thing that client can handle. Unfortunately, the market share of Chromium is large which we cannot simply ignore this inconvenience.

My concern is that Encoding detection is hard and error-prone. It's time to introduce additional dependency to handle this. I plan to add a flag to toggle this auto-detection feature.

ZacJW commented 3 years ago

You're right about not being able to ignore chromium. Looks like they're still using Windows-1252 encoding for this by default :laughing:. Guess I'll go write a bug report for them. Thanks again.

weihanglo commented 3 years ago

I am going to merge this PR since it at least make the blind guess a bit better. Next, I'll try to integrate chardetng, which is used in Firefox. Its design doc is very well-written and worth a read 👍🏾

Thanks for your review again!