Closed wumpus closed 7 years ago
Yeah, this is consistent with most other Python libraries it seems, except aiohttp
it seems. Probably because httplib
, and therefore urllib3
and requests
all return HTTP headers as native strings. PEP 3333 also specifies that headers are to be sent as native strings.
I don't know the reasoning for this and perhaps this wasn't the right call.. (for example, requests just recently fixed an issue related to header encoding: https://github.com/kennethreitz/requests/issues/3888) but seems to be the more consistent thing to do in relation to general Python 3 treatment of http headers? @nlevitt do you have any thoughts on this?
aiohttp's normal response headers object is str, but it has done processing like removing the set cookies into a separate object. So I guess a normal user of aiohttp would always see str. Just another nit that an archivist would run into.
(On the request side, aiohttp computes the full outgoing headers, makes the request, and then discards the request object. Gah.)
I like bytes
.
I'm inclined to keep it as is for now. Switching warcio.StatusAndHeaders
to use bytes
over str
would be a significant, breaking change and the implications of the change aren't clear yet.
The library was written with the read use case in mind first, and writing was added later. As Python 3 generally treats headers as str
, it is not clear that this change would make the warcio
as a library easier to use. Currently, the statusline and headers can be sent back to a WSGI start_response
directly:
start_response(record.http_headers.statusline, record.http_headers.headers)
with the change, the user would then have to deal with the conversion from bytes->str
Don't really have time to work on this now, but can keep this open to come back at a later time.
Perhaps a simple conversion function to convert from a byte-tuple list could be added to help this use case. Or, can automatically check the headers
in the constructor and auto-convert. Do you have time/want to add a simple PR for that @wumpus ?
Yeah, I can do that, it's code I was just writing so I could use the existing API :-)
Thanks for the PR! Released in warcio 1.4.0!
I'm working through trying to emit headers that are as unprocessed as possible. aiohttp's response object has a raw_headers variable that's a list of 2ples (good) and the values are bytes, not str. That ought to be good, right? But the warcio code appears to assume that the headers are str (python3).
While I could certainly decode the headers before calling warcio, and the charset iso-8859-1 is safe for round-tripping like that, I'm just wondering if taking str is a bad interface. You aren't doing https://tools.ietf.org/html/rfc5987 processing, for example, if non-ISO-8859-1 codepoints are in the str.