whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.03k stars 2.63k forks source link

RFC 2388 has been obsoleted by RFC 7578 #398

Closed cvrebert closed 8 years ago

cvrebert commented 8 years ago

Regarding multipart/form-data, HTML currently refers to RFC 2388, but that RFC has been obsoleted by RFC 7578. HTML should thus be updated to refer to the new RFC. The new RFC has an appendix explaining the changes since the old RFC: https://tools.ietf.org/html/rfc7578#appendix-A

domenic commented 8 years ago

I really wish we had some assurances that the changes being made were actually web compatible, or tested against the real world. As is, reading that appendix, I have no idea if the changes bring the RFC more in line with browsers, or less.

I guess one thing that happened is that our note:

In particular, this means that multiple files submitted as part of a single <input type=file multiple> element will result in each file having its own field; the "sets of files" feature ("multipart/mixed") of RFC 2388 is not used.

is no longer as necessary, since apparently they realized multipart/mixed was a no-go.

I wonder about our

field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388

and

User agents must not use the RFC 2231 encoding suggested by RFC 2388

cvrebert commented 8 years ago

User agents must not use the RFC 2231 encoding suggested by RFC 2388

Regarding filenames, the new RFC no longer refers to RFC 2231. Instead it offers:

In most multipart types, the MIME header fields in each part are restricted to US-ASCII; for compatibility with those systems, file names normally visible to users MAY be encoded using the percent-encoding method [...] Some commonly deployed systems use multipart/form-data with file names directly encoded including octets outside the US-ASCII range. The encoding used for the file names is typically UTF-8, although HTML forms will use the charset associated with the form.

The only remaining reference to RFC 2231 anywhere in the new RFC is in §5.1.3.

HTML5 forms' character encoding scheme is referenced in §5.1.2.

cvrebert commented 8 years ago

field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388

I assume this is referring to RFC 2388, §5.4 Non-ASCII field names:

Note that MIME headers are generally required to consist only of 7-bit data in the US-ASCII character set. Hence field names should be encoded according to the method in RFC 2047 if they contain characters outside of that set.


The new RFC's appendix says:

The handling of non-ASCII field names has changed -- the method described in RFC 2047 is no longer recommended

And summarizing the new RFC's §5.1. Non-ASCII Field Names and Values:

While RFC 2388 suggested that non-ASCII field names be encoded according to the method in RFC 2047, this practice doesn't seem to have been followed widely. [...] For broadest interoperability with existing deployed software, those creating forms SHOULD avoid non-ASCII field names. [...] If non-ASCII field names are unavoidable, form or application creators SHOULD use UTF-8 uniformly.

5.1.2. Interpreting Forms and Creating multipart/form-data Data [Explicitly refers to and describes HTML5 forms' character encoding scheme]

5.1.3. Parsing and Interpreting Form Data [Parsing is kind of a mess] In particular, some multipart/form-data generators might have followed the previous advice of RFC 2388 and used the "encoded-word" method of encoding non-ASCII values, as described in RFC 2047 [Mentions other possibilities seen in the wild]


So it seems that the referenced note could be removed.

cvrebert commented 8 years ago

@domenic So does that answer your questions?

domenic commented 8 years ago

I think it does; thank you, and sorry for the delay. Would you be willing to do a pull request updating the reference and removing the now-obsolete notes and such?