whitequark / rack-utf8_sanitizer

Rack::UTF8Sanitizer is a Rack middleware which cleans up invalid UTF8 characters in request URI and headers.
MIT License
315 stars 53 forks source link

sanitize_null_bytes should add the unicode replacement character #81

Open collimarco opened 9 months ago

collimarco commented 9 months ago

Thanks for this useful gem. Currently if you use the default :replace strategy, invalid characters are replaced with �, but the null byte is replace with nothing.

This behavior seems unexpected and inconsistent.

Expected: "Hello \x00 world" => "Hello � world" Actual: "Hello \x00 world" => "Hello world"

collimarco commented 9 months ago

The fix would be straightforward, we just need to change this line:

https://github.com/whitequark/rack-utf8_sanitizer/blob/7dcc1e06786e6b8adfa53a99e407d975b7e39434/lib/rack/utf8_sanitizer.rb#L43

Even the test looks strange (it suggests a replacement, but it actually removes null bytes):

https://github.com/whitequark/rack-utf8_sanitizer/blob/7dcc1e06786e6b8adfa53a99e407d975b7e39434/test/test_utf8_sanitizer.rb#L352

If this choice is by design (is it good? is it bad?), it should be clarified in the documentation in any case, because this is definitely not what you expect from reading the docs.

geoffharcourt commented 7 months ago

Hi @collimarco this is achievable with a custom strategy without too much code:

# config/application.rb

replace_null_byte = lambda do |input, sanitize_null_bytes: true|
   input.
     force_encoding(Encoding::ASCII_8BIT).
     encode!(Encoding::UTF_8,
             invalid: :replace,
             undef:   :replace)

  if sanitize_null_bytes && input =~ Rack::UTF8Sanitizer::NULL_BYTE_REGEX
    input = input.gsub(Rack::UTF8Sanitizer::NULL_BYTE_REGEX, "")
  end

  input
end

config.middleware.insert 0, Rack::UTF8Sanitizer, sanitize_null_bytes: true, strategy: replace_null_byte