webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
375 stars 58 forks source link

Multiple Cookies are prolematic when parsing warc files #53

Open nmunro opened 5 years ago

nmunro commented 5 years ago

When reading a warc file that contains 'Set-Cookie' header and there are multiple cookies present on subsequent lines, the parsing logic breaks the line on the first colon, which appears to be fine for headers, but when the line is actually a continuation of cookies from the previous line, they're incorrectly being added to the http_headers property.

Having played with the warcio code here (https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L262) and adding the following:

while line:
    if line.startswith('Set-Cookie:'):
        print('Testing Set-Cookie line -> ', line)

I can see that the cookies on subsequent lines are not picked up, which is expected of course, but I always like to test my hypothesis before just assuming.

Here are the http headers that I have in a warc files:

HTTP/1.1 200 OK 
Cache-Control: private
Content-Length: 25858
Content-Type: text/html; charset=utf-8
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0
Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx
X-Frame-Options: SAMEORIGIN
Date: Sat, 06 Oct 2018 23:42:11 GMT

This is the code I used to access the headers:

>>> for header in a.items[0].record.http_headers.headers:
...     print(header)
...
('Cache-Control', 'private')
('Content-Length', '25858')
('Content-Type', 'text/html; charset=utf-8')
('Vary', 'Accept-Encoding')
('Server', 'Microsoft-IIS/10.0')
('Set-Cookie', 'ASP.NET_SessionId=xxx; path=/; secure; HttpOnly')
('COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23', '42:12 GMT; path=/; secure; HttpOnly')
('X-Frame-Options', 'SAMEORIGIN')
('Date', 'Sat, 06 Oct 2018 23:42:11 GMT')

I'm looking into how pywb 0.33 did this before this was extracted to see if there's a difference in behavior.

wumpus commented 5 years ago

Didn't the header continuation lines originally have whitespace at the start? Can you mention the actual url so we can look at the actual headers?

nmunro commented 5 years ago

I'm in the UK so out of the office now, but I'll see what I can do to provide more information, thanks!

ikreymer commented 5 years ago

Do you have an example where a server actually sends a multi-line Set-Cookie header in this way?

I've never actually seen http header continuations used at all in practice, and in fact, it appears they have been deprecated: https://stackoverflow.com/questions/31237198/is-it-possible-to-include-multiple-crlfs-in-a-http-header-field/31324422#31324422

Yes, the continuation line is supposed to start with whitespace. The support for this has probably been dropped as its never been used, and is now generally deprecated.

Multiple cookies are pretty much always served like this with multiple Set-Cookie headers:

Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
Set-Cookie: COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
Set-Cookie: COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx

But, if there is a use case, it wouldn't be too hard to add header continuations for legacy reasons.

wumpus commented 5 years ago

Requests can have multiple cookies in one header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cookie but not responses. The thing quoted is a response.

nmunro commented 5 years ago

We have identified the warc as NOT having spaces at the beginning of the line, manually patching the warc to have spaces at the beginning results in this:

('Cache-Control', 'private')
('Content-Length', '25858')
('Content-Type', 'text/html; charset=utf-8')
('Vary', 'Accept-Encoding')
('Server', 'Microsoft-IIS/10.0')
('Set-Cookie', 'ASP.NET_SessionId=xxx; path=/; secure; HttpOnly COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx')
('X-Frame-Options', 'SAMEORIGIN')
('Date', 'Sat, 06 Oct 2018 23:42:11 GMT')

I'm afraid for customer confidentiality reasons there is information I've had to redact, apologies about that, but at this time I'm not allowed to release it, however we have tested the site and it returns cookies one by one each with the 'Set-Cookie:' prefix, so we're confident that the issue is not with the site. It's worth noting that this warc was not written by warcio , but another tool we're using.

It maybe that this tool writes using the old, deprecated form, however, I imagine we may very well have other warcs and maybe arcs that use the old form, so support for it would be very useful to us, and I imagine many others. We will also see about getting our patches to other tools in while we're at it.

I'm happy to develop a fix for this, if it's acceptable to you?

ikreymer commented 5 years ago

If the site correctly returns cookie one a time with the Set-Cookie header, eg:

Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly
Set-Cookie: COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly
Set-Cookie: COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx

and you have written them to a warc as:

Set-Cookie: ASP.NET_SessionId=xxx; path=/; secure; HttpOnly COOKIE_A=xxx|False; domain=xxx; expires=Tue, 03-Oct-2028 23:42:12 GMT; path=/; secure; HttpOnly COOKIE_B=xxx;Path=/;HttpOnly;Domain=xxx

then you likely have improperly written warcs that should be fixed ;) The above would not be a valid Set-Cookie header that a server has served, so it seems like unfortunately the warc writer that you have has written the Set-Cookie header incorrectly.

You can use the warcio library to parse out the set-cookie header, split the cookie on the newline, and write out a new warc with a set-cookie header per cookie. It sounds like this would be best approach and also accurately represent what the actual server supplied. The newline continuations would not be needed for this. While this may be more work, its the only way to guarantee that you have valid WARCs going forward.

This seems fairly specific to your use case, so I would say you should just create a script that does this outside of warcio for now. This could be added as a part of a warc 'fixup' process later if warcio were to support that (currently warcio only does fixing of warcs related to compression, not to http headers).

ikreymer commented 5 years ago

Just to add, it looks like warcio does still support continuation lines (https://github.com/webrecorder/warcio/blob/master/warcio/statusandheaders.py#L275), so the above works as expected after adding a space. But, if you're already patched the warc to add a space, just patch it with a Set-Cookie: prefix instead of a space and you should be good!

nmunro commented 5 years ago

That broke the parsing of the warc file, I'm guessing because of byte offsets? I didn't generate the warc file, so I'm tracing where it came from and how it was generated, we use a number of tools to generate warc files, so it'll have to be tomorrow before I know who created it and how.