Open avit opened 3 years ago
I don't understand the correctness or the performance properties of this PR to be comfortable signing off on it.
Hi @whitequark,
On correctness—I just reverted one change back to the sprintf
implementation used in stdlib URI for the tests to pass. (My previous attempt was faster than sprintf
, but the output was lowercase hex.)
This is mainly changing gsub(PATTERN, &block)
into gsub(PATTERN, lazy_cached_lookup_table)
, so that every matched byte doesn't need to be re-encoded with new string objects in the yielded block.
I added a test for benchmarking: ruby test/bench_utf8_sanitizer.rb
. Running it with RubyProf, with sample data that has a moderate proportion of encoded characters (20%), it shows the code spends over half of its time in reencode_string
with many calls to sprintf
.
For a 100kb www-form-urlencoded payload, I'm measuring master branch at about 15ms per request, and these revisions at about 5ms.
I'm not yet convinced of this change since I just wrote it. I only want to share it to get more eyes on it & decide if it's pursuing further. I won't be sad if you decide to close it. 😄
FWIW we have an endpoint that processes a massive form-encoded payload and just this middleware can take anywhere from 1.5s to 11s(!!) to process the request body. So we'd love this (or any performance-focused PR) to land. I can take a closer look later this week, but I'd be curious what it will take to get this merged & released @whitequark
I can take a closer look later this week
Please try running this PR in your environment, and if it works well, I'll merge it.
Using gsub with block syntax is slow. This implementation uses lookup tables to cache the mapping of encoded/decoded characters since these are known subsets / ranges, and handle replacement in a way that doesn't have any overhead for yielding.
Benchmarks for random payloads containing 20% urlencoded characters show a considerable improvement:
Before:
After:
Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.