The idea is to migrate away from the current pywb-based MITM proxy, and instead use the CDP protocol to capture network traffic from the browser, via CDP protocol. The CDP protocol now allows for fairly sophisticated request/response interception.
This approach includes a number of benefits that come with allowing the browser to handle the networking/data transfer:
Ability to capture HTTP/2 and HTTP/3 network traffic, as the browser handles all the connections with actual site, instead of how a HTTP/1.1 MITM proxy is able to retrieve the data.
Ability to rely on browser certificate checking, store whether or not a browser trusts a particular cert via TLS (since no MITM proxy)
Possibly improved performance - can write WARCs and CDX indexes directly in one process and generate WACZ without reindexing.
More flexibility about choosing what to archive, adding custom WARC headers.
Better data locality: all HTTP traffic for a page can be stored together in one WARC file.
Able to load pages that pywb proxy may have issues with, but a browser can handle (old TLS configs, etc..)
Other trade-offs:
No transfer-encoding or content-encoding data, as the browser returns responses with these removed.
Browser could change data in other ways, eg. header casing, etc..
The idea is to migrate away from the current pywb-based MITM proxy, and instead use the CDP protocol to capture network traffic from the browser, via CDP protocol. The CDP protocol now allows for fairly sophisticated request/response interception. This approach includes a number of benefits that come with allowing the browser to handle the networking/data transfer:
Other trade-offs:
Substantial work on this has been done in the
recorder-work
branch: https://github.com/webrecorder/browsertrix-crawler/tree/recorder-work