whitequark / rack-utf8_sanitizer

Rack::UTF8Sanitizer is a Rack middleware which cleans up invalid UTF8 characters in request URI and headers.
MIT License
315 stars 53 forks source link

Optimize sanitizing www-form-urlencoded input #58

Open avit opened 3 years ago

avit commented 3 years ago

Using gsub with block syntax is slow. This implementation uses lookup tables to cache the mapping of encoded/decoded characters since these are known subsets / ranges, and handle replacement in a way that doesn't have any overhead for yielding.

Benchmarks for random payloads containing 20% urlencoded characters show a considerable improvement:

Before:

0.000900     0.000565    0.003071    0.033079    0.305175    3.015223

After:

0.000836     0.000488    0.001445    0.011248    0.110888    1.107596

Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.

whitequark commented 3 years ago

I don't understand the correctness or the performance properties of this PR to be comfortable signing off on it.

avit commented 3 years ago

Hi @whitequark,

On correctness—I just reverted one change back to the sprintf implementation used in stdlib URI for the tests to pass. (My previous attempt was faster than sprintf, but the output was lowercase hex.)

This is mainly changing gsub(PATTERN, &block) into gsub(PATTERN, lazy_cached_lookup_table), so that every matched byte doesn't need to be re-encoded with new string objects in the yielded block.

I added a test for benchmarking: ruby test/bench_utf8_sanitizer.rb. Running it with RubyProf, with sample data that has a moderate proportion of encoded characters (20%), it shows the code spends over half of its time in reencode_string with many calls to sprintf.

For a 100kb www-form-urlencoded payload, I'm measuring master branch at about 15ms per request, and these revisions at about 5ms.

I'm not yet convinced of this change since I just wrote it. I only want to share it to get more eyes on it & decide if it's pursuing further. I won't be sad if you decide to close it. 😄

sdhull commented 3 years ago

FWIW we have an endpoint that processes a massive form-encoded payload and just this middleware can take anywhere from 1.5s to 11s(!!) to process the request body. So we'd love this (or any performance-focused PR) to land. I can take a closer look later this week, but I'd be curious what it will take to get this merged & released @whitequark

whitequark commented 3 years ago

I can take a closer look later this week

Please try running this PR in your environment, and if it works well, I'll merge it.