cosmicexplorer commented 4 months ago

Hello, I love this project!!! I was reviewing the verification mechanisms for WACZ files at https://specs.webrecorder.net/wacz-auth/0.1.0/#proof-of-authenticity, and noticed a peculiar wording in their requirements definition (emphasis mine):

Proving web archive authenticity can be difficult. Ideally, proof of authenticity could guarantee that any web server served a particular URL at a particular point in time. Unfortunately, this is not currently possible with existing web standards, as even TLS does not provide "non-repudiation".

The rest of the document goes on to describe the difficulty with verifying timestamps (and their great mechanism to address that), which I understand is necessary. However, if I created a WACZ archive of some pages, is it currently possible for the website operator to simply claim the WACZ was falsified when it was first created (e.g. the archive was created against a fake site with the same domain)? Even if the HTTPS cert doesn't verify timestamps, it seems extremely useful to be able to say "this website definitely absolutely served this content at some point" by incorporating the HTTPS certificate, and then relying on the rest of the authenticity spec to provide additional confirmation of its contents.

Do I understand this problem correctly, or does WACZ already incorporate the HTTPS certificate at time of archive creation in a way that verifies the content was actually downloaded from the remote server? It seems like the wacz-auth spec tries very hard to solve a more specific problem with timestamping and may have missed the opportunity to add this additional layer of verification, but I'm not sure.

cosmicexplorer commented 4 months ago

I'm very much not a TLS expert, and after thinking/reading up more I realize this is probably a much bigger project than I initially expected. To restate, the goal is to establish cryptographic proof that a web page's content was unmodified in transit from the declared domain.

Motivation

I believe (please correct me!) that currently a compliant WACZ archive with arbitrary content can be generated for any arbitrary domain (e.g. by editing /etc/hosts to point to your own IP). So if I created an archive right now of example.com and self-signed it according to wacz-auth, the owner of example.com would still be able to claim I simply fabricated the evidence when I created the archive, because WARC/WACZ doesn't appear to retain TLS data.

I believe (not sure!) that TLS/HTTPS is intended to provide the exact kind of source authentication guarantees I want from WARC/WACZ, even if the current standards do not cryptographically verify timestamps like the wacz-auth spec authors would prefer.

Implementation

I still need to figure out which data we would need to add to WARC, and which outputs we need to grab from TLS. It's possible that the cryptographic guarantees I want cannot be provided from TLS, but I think they can.

Wikipedia on TLS describes:

TLS supports many different methods for exchanging keys, encrypting data, and authenticating message integrity. As a result, secure configuration of TLS involves many configurable parameters, and not all choices provide all of the privacy-related properties described in the list above (see the tables below § Key exchange, § Cipher security, and § Data integrity).

However, I am very much under the impression that this kind of complexity should be handled by libcurl or similar. I strongly suspect "generate cryptographic proof of webpage authenticity (which can later be verified offline against the public HTTPS cert)" is not a new idea, so I'm really hoping we can "just" adapt some existing code and add a few new fields to WARC or WACZ.

[ ] Characterize the cryptographic trail left by TLS/HTTPS (asymmetric key used to derive ephemeral symmetric key (?)).
- [ ] Confirm this cryptographically proves content was unmodified from the source domain (the owner of the TLS cert private key), even if it does not verify timestamps (wacz-auth handles timestamps great and we don't need to change it at all!).
[ ] Prototype the above by modifying an existing WARC crawler (which one?) to record the necessary TLS outputs/intermediates.

Please comment if you believe that I've misrepresented the authentication guarantees of TLS or otherwise missed a reason this won't actually work the way I want it to!

rneilson commented 4 months ago

The problem with that is, short version, once both parties complete the handshake and have a shared symmetric key, all data transmitted can be forged by either party -- nothing in an HTTPS transport is signed except for parts of the handshake.

What you would need is to have the server actually produce a signature of the content using an asymmetric keypair (presumably the same as the TLS certificate used), which is not the same as the shared symmetric key produced by asymmetric key exchange. Essentially what the signed HTTP exchanges proposal is that the WACZ signing doc linked.

cosmicexplorer commented 4 months ago

The problem with that is, short version, once both parties complete the handshake and have a shared symmetric key, all data transmitted can be forged by either party -- nothing in an HTTPS transport is signed except for parts of the handshake.

Ah, ok—this makes perfect sense! I was hoping that the symmetric key itself would be a form of signature somehow, but I absolutely see now how symmetric session encryption is intrinsically forgeable because the same key is available to both participants.

What you would need is to have the server actually produce a signature of the content using an asymmetric keypair (presumably the same as the TLS certificate used), which is not the same as the shared symmetric key produced by asymmetric key exchange. Essentially what the signed HTTP exchanges proposal is that the WACZ signing doc linked.

Thank you so much! I believe I mistakenly conflated the (very cool and necessary) timestamp verification mechanisms developed for wacz-auth with their very short dismissal of TLS/HTTPS, and wasn't sure whether NIH was at play (because I'm not an expert in networking yet). But it's clear to me now why signed HTTP is actually a very direct answer to this.

I will maybe look to see if I can improve the wording here to avoid mistakes like mine, but I'm not sure if that's necessary now that I have more context.

ikreymer commented 4 months ago

@cosmicexplorer it's a good question (perhaps we should cover it somewhere) and @rneilson thanks for quick response!

Yes, unfortunately, TLS lacks 'non-repudiation' so it is not possible to use TLS to prove that the particular decoded response was in fact served by the server, due to the symmetric key.

The best we can do is prove that the a particular party (the observer/witness) created the archive. We use TLS certs to extend the existing PKI and cert transparency logs to be able to say the a particular archive was created by whomever owns a particular TLS cert. Eg. we can create an archive and sign it with a cert such as signing.webrecorder.net and it can be proven that this archive was created by whomever owns that domain, and others can do the same and distribute WACZ files. Trust can then be built by extending trust in domains.

Unfortunately, this still requires domain ownership, and we don't yet have a clear solution for ascertaining identity without a domain.

webrecorder / specs

question: incorporation of HTTPS cert data for additional authenticity check #147

Motivation

Implementation