Closed osiegmar closed 3 years ago
That's a draft, I did find a published RFC that has UTF-8 definitions. That would be section 3 in RFC 6532: https://tools.ietf.org/html/rfc6532
Their syntax is much simpler
Section 3.1 of RFC 6532 refers to RFC 3629 or did I misunderstand something. What TEXTDATA
grammar would you think of?
The difference would be whether we would spell out the actual values or import by reference
The difference would be:
TEXTDATA = HTAB / %x20-21 / %x23-2B / %x2D-7F / UTF8-2 / UTF8-3 / UTF8-4 UTF8-2 = <Defined in Section 4 of RFC3629> UTF8-3 = <Defined in Section 4 of RFC3629> UTF8-4 = <Defined in Section 4 of RFC3629>
TEXTDATA = HTAB / %x20-21 / %x23-2B / %x2D-7F / UTF8-2 / UTF8-3 / UTF8-4
If we're skipping the control characters from %x00-1F why should we allow %x7F? On the other hand UTF8-tail contains Unicode C1 control characters (%x80-9F).
UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
Also note that Section 4 of RFC 3629 says:
NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative. Implementors are urged to rely on the authoritative source, rather than on this ABNF.
So I'm wondering how this is handled by other (standard) RFCs.
Seen by chance: https://tools.ietf.org/html/rfc7159#section-7
unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
I see this carried over into RFC 8259 as well. I like this approach, it is much simpler than the alternative. Perhaps we should go with this?
I drafted a PR #18. I'm still unsure about my comment above:
If we're skipping the control characters from %x00-1F why should we allow %x7F? On the other hand UTF8-tail contains Unicode C1 control characters (%x80-9F).
This also seem ambivalent in RFC 8259.
For ASCII, I would skip %x7F as per the existing definition in RFC 4180. For Unicode C1 characters, let's leave them in but open a separate issue and keep it open until we get a better idea.
Can you adjust the PR To exclude %x7F?
The new version says:
This is good and common practice nowadays. Although the ABNF has to be updated accordingly because currently, we have:
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
My first proposal for further discussion is:
TEXTDATA = %x20-21 / %x23-2B / %x2D-D7FF / %xE000-FFFD / %x10000-10FFFF
Possibly relevant: Unicode in ABNF (DRAFT), RFC 3629.