webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd
GNU Affero General Public License v3.0
850 stars 60 forks source link

Incorrect handling of multiple headers #19

Closed phiresky closed 4 months ago

phiresky commented 3 years ago

While implementing firefox support, I noticed the following:

The CDP Network.Headers interface returns headers as a JSON object of keys and values, while the webRequest API returns a list of keys and values.

HTTP Headers can appear multiple times though, so how does CDP handle multiple of the same header?

I tested this with a tiny test server, and when the response contains two values for the same header:

HTTP/1.1 200 OK
X-Powered-By: Express
Set-Cookie: foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Set-Cookie: baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 12
ETag: W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE"
Date: Sat, 06 Mar 2021 14:28:57 GMT
Connection: keep-alive
Keep-Alive: timeout=5

The following is stored in the database:

respHeaders:
Connection: "keep-alive"
Content-Length: "12"
Content-Type: "text/html; charset=utf-8"
Date: "Sat, 06 Mar 2021 14:28:57 GMT"
ETag: "W/"c-Lve95gjOVATpfV8EL5X4nxwjKHE""
Keep-Alive: "timeout=5"
Set-Cookie: "foo=bar; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT, baz=hio; Path=/; Expires=Sat, 06 Mar 2021 14:28:57 GMT"
X-Powered-By: "Express"

The headers are concatted simply with comma. This seems to be "correct" for most headers, but not all of them. For set-cookie for example, the cookies are mangled and not parseable. This means that the resulting web archive will set mangled cookies.

I found some more info here: https://stackoverflow.com/a/38406581/2639190

ikreymer commented 3 years ago

Thanks for pointing this out! Probably should add a way to split these up when storing the headers. Cookies are a bit tricky, and actually are not supported being set from a service worker, so the Set-Cookie headers are not replayed. Instead, what happens is that the actual Cookie from the request is saved in another header, and then wabac.js injects that cookie into the page via document.cookie.

This seems sufficient for replay, and actually to be more accurate it should probably save the cookies from document.cookie itself, as only those cookies that are accessible to the document are needed for replay...

phiresky commented 3 years ago

actual Cookie from the request is saved in another header

Huh, did not see that. That makes sense. So for replay it doesn't really matter (assuming the Cookie header doesn't have the same problem with concatenation), though it does decrease the fidelity of the resulting web archive somewhat.

ikreymer commented 4 months ago

Haven't seen any issues related to this in a while, likely fixed by the referenced commit.