whitequark / rack-utf8_sanitizer

Rack::UTF8Sanitizer is a Rack middleware which cleans up invalid UTF8 characters in request URI and headers.
MIT License
315 stars 53 forks source link

Sanitisation seems to replace valid UTF-8 when a single invalid character is present #88

Open adamransom opened 1 hour ago

adamransom commented 1 hour ago

I might be misunderstanding the purpose of the gem, but the README states

Re-encode it as UTF-8, replacing invalid and undefined characters as U+FFFD.

whereas it doesn't just replace invalid characters, but all UTF-8 characters. This is even more of a problem when sanitising null-bytes, as a single null-byte will wipe out all UTF-8 characters.

Expected: "Hello \xE0 World 😁" => "Hello � world 😁" Actual: "Hello \xE0 World 😁" => "Hello � world ����"

whitequark commented 1 hour ago

This is a bug.