webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
369 stars 58 forks source link

headers as bytes #16

Closed wumpus closed 7 years ago

wumpus commented 7 years ago

I'm working through trying to emit headers that are as unprocessed as possible. aiohttp's response object has a raw_headers variable that's a list of 2ples (good) and the values are bytes, not str. That ought to be good, right? But the warcio code appears to assume that the headers are str (python3).

While I could certainly decode the headers before calling warcio, and the charset iso-8859-1 is safe for round-tripping like that, I'm just wondering if taking str is a bad interface. You aren't doing https://tools.ietf.org/html/rfc5987 processing, for example, if non-ISO-8859-1 codepoints are in the str.

ikreymer commented 7 years ago

Yeah, this is consistent with most other Python libraries it seems, except aiohttp it seems. Probably because httplib, and therefore urllib3 and requests all return HTTP headers as native strings. PEP 3333 also specifies that headers are to be sent as native strings.

I don't know the reasoning for this and perhaps this wasn't the right call.. (for example, requests just recently fixed an issue related to header encoding: https://github.com/kennethreitz/requests/issues/3888) but seems to be the more consistent thing to do in relation to general Python 3 treatment of http headers? @nlevitt do you have any thoughts on this?

wumpus commented 7 years ago

aiohttp's normal response headers object is str, but it has done processing like removing the set cookies into a separate object. So I guess a normal user of aiohttp would always see str. Just another nit that an archivist would run into.

(On the request side, aiohttp computes the full outgoing headers, makes the request, and then discards the request object. Gah.)

nlevitt commented 7 years ago

I like bytes.

ikreymer commented 7 years ago

I'm inclined to keep it as is for now. Switching warcio.StatusAndHeaders to use bytes over str would be a significant, breaking change and the implications of the change aren't clear yet.

The library was written with the read use case in mind first, and writing was added later. As Python 3 generally treats headers as str, it is not clear that this change would make the warcio as a library easier to use. Currently, the statusline and headers can be sent back to a WSGI start_response directly:

start_response(record.http_headers.statusline, record.http_headers.headers)

with the change, the user would then have to deal with the conversion from bytes->str

Don't really have time to work on this now, but can keep this open to come back at a later time.

ikreymer commented 7 years ago

Perhaps a simple conversion function to convert from a byte-tuple list could be added to help this use case. Or, can automatically check the headers in the constructor and auto-convert. Do you have time/want to add a simple PR for that @wumpus ?

wumpus commented 7 years ago

Yeah, I can do that, it's code I was just writing so I could use the existing API :-)

ikreymer commented 7 years ago

Thanks for the PR! Released in warcio 1.4.0!