ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.68k stars 2.53k forks source link

Proposal: Improve Hex Escape Sequence #17376

Open exxjob opened 1 year ago

exxjob commented 1 year ago
Sequence Name {N}
\x{N} hexadecimal value 32 digits >= 1 digits

The hex escape should allow underscore visual separators, like their int literal counterparts. Sound 8-bit clean collation and identifiers in ZON and elsewhere, like UUIDs and codepoints. More symmetry with int literals and templates.

exxjob commented 1 year ago

If this plus solutions #17385 and #14534 are accepted, we can achieve this representation in ZON:

.{ 
    .@"\x{ ff\  0\  0}" = "red",
    .@"\x{  0\ ff\  0}" = "green",
    .@"\x{  0\  0\ ff}" = "blue"
}

Similarly, the following: https://github.com/ziglang/zig/blob/153ba46a5b20f178d48ef2f09e0e638a3749af0e/lib/std/zig/tokenizer.zig#L1477 Becomes:

try testTokenize("//\x{f4\ 8f\ bf\ bf}", &.{}); 

or alternatively:

try testTokenize("//\u{10FF_FF}", &.{}); 
exxjob commented 7 months ago

While this proposal, motivated by use cases in declarative ZON, erstwhile covered also binary/octal/decimal escapes, that's likely excessive for scenarios where plaintext ZON is exchanged, and there isn't much demand for that in regular strings. A sufficient solution may be editor plugins to overlay other radices on hex escapes, where needed.

I'm leaving this issue open however, since I think \u{N} should be consolidated into \x{N}. Ideally, this would work best with above mentioned delimiters and visual underscore separators, and should be capped at 32 hex digits, which corresponds to 128-bit UUIDs. Issues updated to reflect changes in proposal.

mnemnion commented 4 months ago

I'm not a fan of half of this proposal. It would remove a useful affordance. No reason to do so is given, so I must conclude it's out of a misguided sense of minimalism. \u{xxxx} escape sequences are useful for generating test data, and representing codepoints which might otherwise display as tofu in a way which can be looked up.

Extending the \x notation we have with an extended \x{ffff} notation for byte sequences of arbitrary length is a fine idea. I see no reason to cap it at any particular length, that part seems arbitrary. It would give a clean mechanism to generate arbitrary byte data. The underscores are also a good idea.

Making the language worse for no reason, is not. UTF-8 centrism is Good, actually. It's the encoding Zig chose for literal string data, and source code. It's the encoding everyone should be using. Privileging it with a codepoint-specific escape sequence is rational, forcing people to figure out how to represent that codepoint in hexadecimal is not rational, it would be regressive.

If you have a compelling reason for that half of your proposal, let's hear it. Otherwise, I suggest closing this issue, it being a bad practice to combine two unrelated changes into one issue, and opening one specific to the \x{ffff} proposal instead.

exxjob commented 4 months ago

@mnemnion

No reason to do so is given.

It says "mistyping an invalid codepoint is like mistyping a valid one". That pretty much sums it up. For the sake of DOTADIW, bring your own sanitizer / coverage / whatever.

Making the language worse for no reason.

Naughty.

UTF-8 centrism is good, actually. It's the encoding everyone should be using.

Intoxicating.

Forcing people to figure out how to represent that codepoint in hexadecimal is not rational.

I'm sorry, you do know that \u escapes are hexadecimal...?

I don't understand most of your comment, ergo I'm too busy making the language worse for no reason. Remember that strings and slices are fairly interchangable concepts in Zig, and there's the bastard ZON. I explained my strife with the state of escseq and canonicalization in my other issues too if you need further reference. I don't care for hot takes on Unicode.

andersen commented 4 months ago

@mnemnion’s concern appears to be that removing the \u syntax would require that, e.g., \u{10FFFF} be written as \x{f4\ 8f\ bf\ bf} instead. However, the proposal also mentions \x{10FF_FF} as an alternative (not meaning \x{10\ FF\ FF}). Is the idea to encode as UTF-8 for values above FF?

mnemnion commented 4 months ago

However, the proposal also mentions \x{10FF_FF} as an alternative (not meaning \x{10\ FF\ FF}). Is the idea to encode as UTF-8 for values above FF?

It would be very weird if "\x10\xff\xff" and \x{10ff_ff} meant different things. I propose, instead, that the latter be a different escape sequence. For example, we could call it \u, for Unicode, indicating that it represents a UTF-8 encoded Unicode codepoint. So "\x10\xff\xff" and "\x{10ff_ff}" would refer to the same three bytes, but \u{10ff_ff} would be equivalent to "\xf4\x8f\xbf\xbf", or "\x{f4_8f_bf_bf}".

I don't understand most of your comment

If you don't understand Unicode well enough to understand my comment, your proposal to change Zig's Unicode handling should be disregarded. It would make the language worse for no reason.

You were given a chance to explain yourself, and chose to snark, without making any technical point of substance. I am forced to conclude that you have no good reasons for this proposal.

exxjob commented 4 months ago

@andersen yes, the syntax in my mind is like so: \x{f4\ 8f\ bf\ bf} or \x{10FF_FF} with the latter's underscore being an optional visual separator, like with int literals. The backslash is used as a delimiter instead of comma to convey that you're inside an escape sequence https://github.com/ziglang/zig/issues/17385

@mnemnion

It would be very weird if "\x10\xff\xff" and \x{10ff_ff} meant different things.

The proposals in amalgamation suggest the syntax "\x{10\ ff\ ff}" and "\x{10ff_ff}". You only need delimiters to communicate that 255255255 and 255, 255, 255 are different things.

I'm not the one being snarky. I'm writing a thesis on character encodings. I'm sorry we don't see eye to eye on rhyme and reason. Check the purpose statement http://exxjob.cc/#utf9

andersen commented 4 months ago

Thanks for confirming. Does that mean that the range U+0080 to U+00FF under your proposal would have to be spelt out as \x{C2\ 80} to \x{C3\ BF}, or do you have some other shorthand syntax in mind?

exxjob commented 4 months ago

@andersen that's the UTF-8 byte sequences, thanks for pointing this out. I'll update the issue.

mnemnion commented 4 months ago

Since this proposal now takes no stance on the use of \u, it consists, from where I stand, entirely of good ideas. Glad we worked that out.

exxjob commented 4 months ago

@mnemnion a good idea is for you to communicate better https://github.com/ziglang/zig/issues/20152#issuecomment-2179291206

mnemnion commented 4 months ago

That's odd feedback coming from a Rick Astley quoter.

I consider this interaction successfully concluded, because you listened, and incorporated that feedback to make the proposal better. I'd offer you advice on how to do that better, but I suspect you can figure that out on your own.

andersen commented 2 months ago

@exxjob Thanks for your clarifications and updates. The example \x{10FF_FF} still appears in an early comment — it should probably be changed to \u{10FF_FF} (or perhaps \u{10_FF_FF} or \u{10_FFFF}) to match the current proposal.