webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
627 stars 81 forks source link

Switch to archiving directly via CDP protocol instead of MITM proxy via pywb #343

Closed ikreymer closed 3 months ago

ikreymer commented 1 year ago

The idea is to migrate away from the current pywb-based MITM proxy, and instead use the CDP protocol to capture network traffic from the browser, via CDP protocol. The CDP protocol now allows for fairly sophisticated request/response interception. This approach includes a number of benefits that come with allowing the browser to handle the networking/data transfer:

Other trade-offs:

Substantial work on this has been done in the recorder-work branch: https://github.com/webrecorder/browsertrix-crawler/tree/recorder-work

ikreymer commented 3 months ago

This was implemented in 1.0!