Open phiresky opened 3 years ago
Wow, thanks for the thorough analysis! I think that's mostly correct, nice work!
The Fetch domain is the main handler, although Network is also used in case some headers are missing. (In older versions of Chrome, some requests/responses did not get to Fetch but did go through Network, so both are being intercepted)
The media event was for an edge case where the URL was not otherwise being intercepted for some reason, may not be an issue with webRequest Some of the rewriting is needed for both capture and replay, to get a fixed resolution. That should also be possible with the webRequest.
One question I'd have is around timing with webNavigation and webRequest. Hopefully, the webNavigation occurs before the first request on that page, so that all subsequent requests can be associated with the new page.
And yes, would probably have a separate 'firefox-recorder' that uses this system instead of the CDP approach used in extension and electron app.
So this all could be done, at least to test it out. But now I think the lack of service worker is actually the bigger issue, as you then can't really see what you've archived, and that messes up the whole workflow. I think it would need to be supported somehow for the extension to be fully usable..
Do you have any thoughts on how to solve that? My only thought was to install the service worker on perhaps a replay.archiveweb.page
domain, and then have the service worker proxy to the actual extension via https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/runtime/Port, so that the extension can read from its IndexedDB.
However, I'm not certain a service worker can get an access to that port.. possibly need a content script which gets the port and the transfers it to the service worker.. This would require some work in wabac.js as well... But, would be curious to know if this is possible.
I don't have the time to work on this now, but happy to provide guidance and accept a PR if you're willing to try this out..
I don't have the time or knowledge to help port this right now, but I'd love to use this extension in Firefox too!
Thanks for your response. I'm not sure if and when I'll have time for this, so we'll see if something happens ;)
What I really want is a permanent complete log and archive of every site I visit (#13), with full text search indexing of everything. That would probably need storage in a external database (I'm thinking PostgreSQL or elastic search) so it's an even larger effort for this to be really useful to me.
Do you have any thoughts on how to solve that? My only thought was to install the service worker on perhaps a replay.archiveweb.page domain, and then have the service worker proxy to the actual extension
I tested it, and it does work by creating a content script in the webextension that adds an iframe of the webextension to the replayweb.page. The content script creates a MessageChannel, which it then passes to the service worker and to the iframe. The service worker then sends RPC calls to the iframe, and the iframe reads from the IndexedDB. It's kinda ugly though. I couldn't figure out how to create a MessageChannel between the service worker and the extension background page (so iframe is not needed) because the chrome.runtime.sendMessage() api does not support "structured clone" which is needed to transfer MessagePorts, and window.postMessage doesn't work between a content script and a background page.
example: content_script.js
const iframeurl = chrome.runtime.getURL("iframe.html");
console.log(
"creating message channel between website service worker and webextension"
);
const channel = new MessageChannel();
const iframe = document.createElement("iframe");
iframe.style.display = "none";
iframe.src = iframeurl;
iframe.onload = () => {
console.log("moz-ext iframe loaded");
iframe.contentWindow.postMessage(
{ msg_type: "port-to-replayweb", port: channel.port1 },
iframe.src,
[channel.port1]
);
navigator.serviceWorker.controller.postMessage(
{
msg_type: "port-to-webextension",
port: channel.port2,
},
[channel.port2]
);
};
document.body.appendChild(iframe);
Thanks for your response. I'm not sure if and when I'll have time for this, so we'll see if something happens ;)
What I really want is a permanent complete log and archive of every site I visit (#13), with full text search indexing of everything. That would probably need storage in a external database (I'm thinking PostgreSQL or elastic search) so it's an even larger effort for this to be really useful to me.
I have wanted something like this, too, but I'd be kind of worried about how quickly it'd fill up, and about recursion when the sites I visit are themselves archiving sites...
So it would be great if this extension could support Firefox. You wrote on HN:
I tried to start implementing it, though I didn't get that far. So here's how I understand the structure of the code and my thoughts for now:
From the popup when pressing "record", in
bg.js
thestartRecorder
function is called, which creates a new BrowserRecorder. BrowserRecorder extends Recorder with a few browser-specific things (as opposed toElectronRecorder
). This recorder attaches the Chrome debugger to the currently focused tab, and adds handlers for a lot of different events:Network.enable are used to intercept and save metadata of request into RequestResponseInfo objects. Finally, in the
Network.loadingFinished
handler the request info object is saved to the IndexedDB.This functionality seems very similar to the https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest API that seems to be supported in Chrome as well. What's missing from that API here?
Fetch.enable blocks all requests, then uses Fetch.getResponseBody to get the request and response body of each network request and continues them. In some special cases, the responses are modified before being continued in rewriteResponse. These rewrites are mostly from wabac.js. Some of them make sense to me like change video resolution. But there's also stuff like JS rewrites. Are these run during capture or only during replay?
This functionality could maybe be replaced with the StreamFilter API in firefox. Seems simple to use, example at the bottom. Sadly not supported in Chrome, so would need separate code paths.
Media.enable for a single special case of handling some "kLoad" event. Seems to be related to watching for video / audio load events (chromium source code). Maybe could be replaced with just listening to the media events client side in the autofetcher.js injected script with JS Media events?
Page.enable. This is to handle page navigation events etc. These seem like they should all be supported by the normal webextension apis like webNavigation?
There seems to be four methods how requests are fetched:
Since 1-3 all use Fetch.getResponseBody, adding StreamFilter should make those work on Firefox. 4. directly uses the fetch api, which is supported in Firefox anyways.
For Firefox support, these things would need to be changed:
Fetch.*
api, used to get and modify the actual payload / body of the request responses.