BrowserRequest fails to decode URL-encoded form values with charset="UTF-8"

Following finally upgrading all of Launchpad's appservers to Python 3, https://bugs.launchpad.net/launchpad/+bug/1937345 now reports a problem with submitting URL-encoded form data containing non-ISO-8859-1 values. For Content-Type: application/x-www-form-urlencoded; charset=UTF-8 and QUERY_STRING=:ws.op=newMessage&content=%E2%80%9Ccomment%E2%80%9D, the traceback looks like this:

Traceback (most recent call last):
  File "/home/cjwatson/src/canonical/launchpad/git/review/env/lib/python3.5/site-packages/zope/publisher/publish.py", line 139, in publish
    request.processInputs()
  File "/home/cjwatson/src/canonical/launchpad/git/review/env/lib/python3.5/site-packages/zope/publisher/browser.py", line 379, in processInputs
    self.__processItem(key, item)
  File "/home/cjwatson/src/canonical/launchpad/git/review/env/lib/python3.5/site-packages/zope/publisher/browser.py", line 452, in __processItem
    item = self._decode(item)
  File "/home/cjwatson/src/canonical/launchpad/git/review/lib/lp/services/webapp/servers.py", line 697, in _decode
    text = super(LaunchpadBrowserRequest, self)._decode(text)
  File "/home/cjwatson/src/canonical/launchpad/git/review/env/lib/python3.5/site-packages/zope/publisher/browser.py", line 284, in _decode
    text = text.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 0: ordinal not in range(256)

(There's a bit of Launchpad in this traceback, but it's only some fallback code and is in practice irrelevant here.)

zope.publisher plays a complicated game of core wars with the underlying libraries it uses to decode form input, presumably for ancient historical reasons. I didn't quite dare to touch that part of the code significantly when converting zope.publisher to multipart, but I think it's going to be necessary to revamp that in order to fix this bug. There are two fundamental problems:

Guessing the request encoding based on Accept-Charset seems fundamentally misguided, since that defines the browser's preferred response encoding. (However, I can accept that it may be necessary on old browsers or something.)
As for the approach of telling underlying libraries to decode form values as ISO-8859-1, then re-encoding as ISO-8859-1 and decoding using some other encoding: as well as being horribly confusing, it can't possibly work correctly for multipart form data, because different parts are allowed to have different encodings! I think a better approach, and certainly one that would be significantly easier to understand, would be for the bulk of processInputs to leave those values as bytes that may require guesswork, and then _decode could leave anything that's already Unicode alone.

I've reproduced this in zope.publisher's test suite, and should be able to come up with a PR soon.

zopefoundation / zope.publisher

BrowserRequest fails to decode URL-encoded form values with charset="UTF-8" #65