webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
655 stars 83 forks source link

Add flag to proceed only with "secure" SSL connection #510

Open benoit74 opened 8 months ago

benoit74 commented 8 months ago

Currently, the crawler proceed with all HTTPS websites, not matter how secure the HTTPS connection is, e.g. certificates might be invalid.

We (Kiwix) would like to be able to ensure (when requested by the user) that the crawler proceeds only with valid HTTPs connections, i.e. we probably need a CLI "flag" to ensure that browser is not accepting insecure HTTPS connections

Is this something feasible? Easy to implement and we could help to at least draft a PR?

ikreymer commented 8 months ago

Sure, yes, this is actually fairly easy to add in the 1.x version, since we're not relying on a MITM proxy. It's just a matter of switching this flag in fact: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/browser.ts#L101 Maybe it should even default to false.

However, I wanted to point out, even without that, we do same the Certificate Transparency info from: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-SecurityDetails in the WARC records.

For a valid HTTPS site, it should have something like this: WARC-JSON-Metadata: {"ipType":"Public","cert":{"issuer":"DigiCert Global G2 TLS RSA SHA256 2020 CA1","ctc":"1"}}

with the ctc flag indicating if the browser considers it a compliant request according to CTC logs, which I think means its a trusted cert. https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-CertificateTransparencyCompliance

For a non-compliant, non-valid cert, this data would look like:

(from https://expired.badssl.com/) {"ipType":"Public","cert":{"issuer":"COMODO RSA Domain Validation Secure Server CA","ctc":"0"}} or (from https://self-signed.badssl.com/) {"ipType":"Public","cert":{"issuer":"*.badssl.com","ctc":"0"}}

benoit74 commented 8 months ago

Great, I will have a look and probably propose a PR then!

What the default value should be (true/false, secure/insecure) has been a long discussion on our side, and I'm not sure we have any kind of alignment even now, details are in the linked issue, if you have some popcorn, might be fun (or not) ^^

Thank you for the additional details, great to know. Might be interesting at some point, I don't know how our arguments will settle in the future. For now I think that we will prefer to fail the scraper immediately if secure mode is requested and HTTPS errors arise. But I can imagine some day we might want to run in secured mode but ignore HTTPS errors on sub-resources ... will see.