UnicodeEncodeError: 'latin-1' codec can't encode character '\u0173' in position 9: ordinal not in range(256) on Python 3.7, while parsing POST requests containing non-ASCII field values

zopefoundation / zope.publisher

Map requests from HTTP/WebDAV clients, web browsers, XML-RPC and FTP clients onto Python objects

Other

3 stars 13 forks source link

UnicodeEncodeError: 'latin-1' codec can't encode character '\u0173' in position 9: ordinal not in range(256) on Python 3.7, while parsing POST requests containing non-ASCII field values #41

Closed mgedmin closed 5 years ago

mgedmin commented 5 years ago

BrowserRequest.processInputs() is working on the assumption that cgi.FieldStorage will give it UTF-8 text encoded to Latin-1, as per PEP-3333. This is not the case: on POST requests on Python 3.7 I'm seeing the FieldStorage hold native str objects containing already-decoded Unicode text, so when BrowserRequest._decode() tries to text = text.encode('latin-1'), things fail.

mgedmin commented 5 years ago

I'm wondering if this is the same issue as #40, only in that case I happened to have Unicode characters that were encodable to Latin-1?

d-maurer commented 5 years ago

Marius Gedminas wrote at 2019-7-2 07:37 -0700:

BrowserRequest.processInputs() is working on the assumption that cgi.FieldStorage will give it UTF-8 text encoded to Latin-1, as per PEP-3333. This is not the case: on POST requests on Pytohn 3.7 I'm seeing the FieldStorage hold native str objects containing already-decoded Unicode text, so when BrowserRequest._decode() tries to text = text.encode('latin-1'), things fail.

cgi.FieldStorage returns "latin-1" decoded bytes if it is correctly called. The parameter is named encoding.

mgedmin commented 5 years ago

cgi.FieldStorage returns "latin-1" decoded bytes if it is correctly called. The parameter is named encoding.

I do not understand what you mean by that? Are you saying that zope.publisher calling cgi.FieldStorage incorrectly?

mgedmin commented 5 years ago

My current theory is that the encode('latin-1') is right for parsing GET requests (with QUERY_STRING coming from the WSGI environment directly), but wrong for parsing POST requests (where the data comes from the wsgi.input BytesIO object).

d-maurer commented 5 years ago

Marius Gedminas wrote at 2019-7-3 03:19 -0700:

cgi.FieldStorage returns "latin-1" decoded bytes if it is correctly called. The parameter is named encoding.

I do not understand what you mean by that? Areyou saying that zope.publisher calling cgi.FieldStorage incorrectly?

Yes -- is you want "latin-1" decoded bytes as values.

d-maurer commented 5 years ago

Marius Gedminas wrote at 2019-7-3 04:11 -0700:

My current theory is that the encode('latin-1') is right for parsing GET requests (with QUERY_STRING coming from the WSGI environment directly), but wrong for parsing POST requests (where the data comes from the wsgi.input BytesIO object).

"cgi.FieldStorage" handles the differences between "GET" and "POST" correctly.

mgedmin commented 5 years ago

I've a patch in progress that fixes my application by dropping zope.publisher's conversion logic and using the Unicode values produced by cgi.FieldStorage directly, when on Python 3.

It breaks zope.publisher's test suite quite badly. I'll have to investigate why the existing tests do not match real-world usage.

mgedmin commented 5 years ago

It breaks zope.publisher's test suite quite badly.

Actually that was just one failing test, repeated for almost every tox environment, producing scary amounts of terminal scrollback.