Possible null byte sanitization feature

whitequark / rack-utf8_sanitizer

Rack::UTF8Sanitizer is a Rack middleware which cleans up invalid UTF8 characters in request URI and headers.

MIT License

314 stars 53 forks source link

Possible null byte sanitization feature #74

Closed jcoleman closed 1 year ago

jcoleman commented 1 year ago

I know this isn't strictly about UTF8, but a very similar problem we encounter commonly with input to web applications is strings containing null bytes. Null bytes are valid in UTF8, but they're a problem in practice because Postgres doesn't accept them for text or varchar data types.

If I added an optional feature to also sanitize null bytes (just as the project currently sanitizes invalid UTF8 bytes) would you be open to merging that into the project?

whitequark commented 1 year ago

I guess we could consider that a subset of UTF-8.

jcoleman commented 1 year ago

Well, null bytes are valid in UTF-8, so it wouldn’t be a subset, but as long as you’re comfortable with that as a feature, I’ll work on adding it and open a PR.

whitequark commented 1 year ago

I mean that an alphabet that's UTF-8 minus null bytes is a subset of UTF-8 (which is itself a subset of an alphabet of all bytes).

jcoleman commented 1 year ago

Ah, makes sense.

jcoleman commented 1 year ago

Closed by #75