whatwg / infra

Infra Standard
https://infra.spec.whatwg.org/
Other
117 stars 93 forks source link

String "is"/"identical to" are harder to use than what they replaced #344

Open tabatkins opened 3 years ago

tabatkins commented 3 years ago

Previously, HTML defined the term "case-sensitive" to mean something had equality based on codepoint-by-codepoint comparison. This was removed in https://github.com/whatwg/html/issues/5067, in favor of Infra defining "is"/"identical to".

I was cleaning up the build of a spec that used the old term and found the new ones, and phew, they're a lot more awkward to use!

Previously, I could write text like:

Such identifiers are fully [=case-sensitive=] (meaning they’re compared by codepoint), even in the ASCII range (e.g. example and EXAMPLE are two different, unrelated user-defined identifiers).

Now, my best attempt at rewriting this is:

Such identifiers are fully case-sensitive (only equal if they're [=identical to=] each other, meaning they're compared codepoint by codepoint)...

Issue here is that the term switched grammatical roles entirely, from an adjective to a verb or, uh, whatever you call what "identical to" is (a partial adjectival phrase?), and the ways in which one would naturally use "case-sensitive" do not cleanly translate to the ways one would naturally use "identical to".

(If you were previously using a construction like "if |foo| is a [=case-sensitive=] match to 'foo', then...", then switching is easy and probably clearer: "if |foo| is [=identical to=] 'foo', then...". But I'm often using it in the form described above.)

aphillips commented 3 years ago

I see the problem and you're right: it does require a rewrite (so do most of the other changes I can think of that address the original problem in 5067). Would it help to provide "identical" also? (e.g. Such identifiers are fully [=identical=] (meaning they're compared by codepoint), even in the ASCII range etc...)

"is" was supposed to be the verb form so if |foo| [=is=] 'foo', then... would be the answer to your last example. Most spec authors are probably cautious about relying on the formal use of "is" being distinct from just plain old "is".

Alternatively... I think we didn't replace "case-sensitive" with some other similar formulation to stay out of trying to pack all the meaning into the phrase. I realize that most of the time most spec authors are thinking of the case variation part of the problem (and the example you give above shows demonstrating this as further explanation). Perhaps "identity-sensitive" (seems maybe to imply to credentials??)? Or "codepoint-sensitive"?

We could also restore "case-sensitive" as an option, but then my I18N colleagues (and I) would probably wander around wanting spec authors to expand on specific examples to note that its more than just case or suggesting pointers to String Matching. This isn't precisely evil, but it would be good to get a formulation that calls out the other things where useful without making it absurdly hard to write specs.

annevk commented 3 years ago

You could perhaps rephrase it as "Note that the identifiers are compared using identical to, i.e., code point by code point."

tabatkins commented 3 years ago

The problem with that is - compared to what? Saying "compared to each other" is okay, but a bit awkward; I'm not happy with that phrasing in my reworded version already. And it's not quite right anyway; really it's when comparing the identifier's value with any other string.

Coming back to an adjective form, if we're avoiding "case-sensitive" (i18n's complaints are reasonable), perhaps "codepoint-sensitive"? After all, the original "case-sensitive" word meant literally "it's sensitive to differences in case", and we then extended that to mean all differences in codepoint, potentially confusing the term; bringing it back, just with a better literal meaning ("it's sensitive to differences in codepoints") would work, I think?

In terms of spec, maybe something like (added right after the "all string comparisons use 'is'" paragraph):

Alternately, a string in some context can be said to be <dfn>codepoint-sensitive</dfn>, meaning when it is compared to any other string, they're equal only if they're [=identical to=] each other.