Closed jcoleman closed 1 year ago
I guess we could consider that a subset of UTF-8.
Well, null bytes are valid in UTF-8, so it wouldn’t be a subset, but as long as you’re comfortable with that as a feature, I’ll work on adding it and open a PR.
I mean that an alphabet that's UTF-8 minus null bytes is a subset of UTF-8 (which is itself a subset of an alphabet of all bytes).
Ah, makes sense.
Closed by #75
I know this isn't strictly about UTF8, but a very similar problem we encounter commonly with input to web applications is strings containing null bytes. Null bytes are valid in UTF8, but they're a problem in practice because Postgres doesn't accept them for
text
orvarchar
data types.If I added an optional feature to also sanitize null bytes (just as the project currently sanitizes invalid UTF8 bytes) would you be open to merging that into the project?