Align ABNF grammar with UTF-8 encoding

yakovsh / rfc4180-bis

Repository for work regarding the new version of RFC 4180

Other

8 stars 2 forks source link

Align ABNF grammar with UTF-8 encoding #13

Closed osiegmar closed 3 years ago

osiegmar commented 3 years ago

The new version says:

Default charset and line break values

Since the initial publication of {{!RFC4180}}, the default charset for "text/*" media types has been changed to UTF-8 (as per {{!RFC6657}}).

This is good and common practice nowadays. Although the ABNF has to be updated accordingly because currently, we have:

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

My first proposal for further discussion is:

TEXTDATA = %x20-21 / %x23-2B / %x2D-D7FF / %xE000-FFFD / %x10000-10FFFF

Possibly relevant: Unicode in ABNF (DRAFT), RFC 3629.

nightwatchcyber commented 3 years ago

That's a draft, I did find a published RFC that has UTF-8 definitions. That would be section 3 in RFC 6532: https://tools.ietf.org/html/rfc6532

Their syntax is much simpler

osiegmar commented 3 years ago

Section 3.1 of RFC 6532 refers to RFC 3629 or did I misunderstand something. What TEXTDATA grammar would you think of?

nightwatchcyber commented 3 years ago

The difference would be whether we would spell out the actual values or import by reference

nightwatchcyber commented 3 years ago

The difference would be:

TEXTDATA = HTAB / %x20-21 / %x23-2B / %x2D-7F / UTF8-2 / UTF8-3 / UTF8-4 UTF8-2 = <Defined in Section 4 of RFC3629> UTF8-3 = <Defined in Section 4 of RFC3629> UTF8-4 = <Defined in Section 4 of RFC3629>

osiegmar commented 3 years ago

TEXTDATA = HTAB / %x20-21 / %x23-2B / %x2D-7F / UTF8-2 / UTF8-3 / UTF8-4

If we're skipping the control characters from %x00-1F why should we allow %x7F? On the other hand UTF8-tail contains Unicode C1 control characters (%x80-9F).

UTF8-octets = *( UTF8-char )
UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1      = %x00-7F
UTF8-2      = %xC2-DF UTF8-tail
UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
              %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
              %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail   = %x80-BF

Also note that Section 4 of RFC 3629 says:

NOTE -- The authoritative definition of UTF-8 is in [UNICODE]. This grammar is believed to describe the same thing Unicode describes, but does not claim to be authoritative. Implementors are urged to rely on the authoritative source, rather than on this ABNF.

So I'm wondering how this is handled by other (standard) RFCs.

osiegmar commented 3 years ago

Seen by chance: https://tools.ietf.org/html/rfc7159#section-7

unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

nightwatchcyber commented 3 years ago

I see this carried over into RFC 8259 as well. I like this approach, it is much simpler than the alternative. Perhaps we should go with this?

osiegmar commented 3 years ago

I drafted a PR #18. I'm still unsure about my comment above:

If we're skipping the control characters from %x00-1F why should we allow %x7F? On the other hand UTF8-tail contains Unicode C1 control characters (%x80-9F).

This also seem ambivalent in RFC 8259.

nightwatchcyber commented 3 years ago

For ASCII, I would skip %x7F as per the existing definition in RFC 4180. For Unicode C1 characters, let's leave them in but open a separate issue and keep it open until we get a better idea.

Can you adjust the PR To exclude %x7F?