webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
380 stars 58 forks source link

warcio does not preserve HTTP header whitespace #129

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago
import io
import warcio

output = io.BytesIO()
writer = warcio.warcwriter.WARCWriter(output, gzip = False)
payload = io.BytesIO()
payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-custom:  header with two spaces before the value and a tab after\t\r\n\r\n')
payload.seek(0)
record = writer.create_warc_record('http://example.org/', 'response', payload = payload)
writer.write_record(record)
print(output.getvalue())

Expected output for the custom header (where \t is a literal tab):

X-custom:  header with two spaces before the value and a tab after\t

Actual output (only one space between the colon and the value, and the tab after the header is lost):

X-custom: header with two spaces before the value and a tab after
ikreymer commented 3 years ago

This is sort of an edge case, and the whitespace was at one point used to indicate multi-line headers (which have now been deprecated, but warcio still supports). I'm not sure that the whitespace is significant anymore from a parsing perspective. Similar to #128, perhaps there could be a 'raw' mode flag that preserves the whitespace here if desired for when capturing HTTP traffic.

ikreymer commented 3 years ago

FWIW, I've never seen an HTTP server that returns a header like this, so (i hope) its not very common :)

JustAnotherArchivist commented 3 years ago

The whitespace on the line with the field-name has never been significant semantically as far as I know. Neither the whitespace after the colon nor the one at the end of the line is part of the actual field value content. And even with continuation lines: the optional whitespace at the end of a line, CRLF, and leading space/tab on the continuation line are overall equivalent to a single space. But yeah, same as #128, this is about correctly preserving the data sent by the server, not the semantic meaning. I've suggested a possible solution there because they are indeed very similar and have essentially the same root cause.

Yeah, it is fortunately not very common, but I have seen it before, sadly enough. There are a lot of weird HTTP servers out there that operate at the edges of or beyond the specifications...