webrecorder / archiveweb.page

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
https://chrome.google.com/webstore/detail/webrecorder/fpeoodllldobpkbkabpblcfaogecpndd
GNU Affero General Public License v3.0
869 stars 62 forks source link

Firefox support and implementation details #16

Open phiresky opened 3 years ago

phiresky commented 3 years ago

So it would be great if this extension could support Firefox. You wrote on HN:

What prevents this from working on non-Chromium-based browsers? At this point, mostly time constraints maintaining two very different implementations.

The archiving is done via the CDP Fetch domain (https://chromedevtools.github.io/devtools-protocol/tot/Fetch...), as it requires intercepting and sometimes modifying the response body of a request to make it more replayable.

Firefox doesn't current support this yet (https://bugzilla.mozilla.org/show_bug.cgi?id=1587426), although, it does have webRequest.StreamFilter instead (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...), which is lacking in Chromium.

It probably should be possible to achieve this functionality in Firefox, using this API, but would unfortunately require a new implementation that uses WebRequest instead of CDP. But probably worth looking into!

The archive replay using ReplayWeb.page should work in Firefox and Safari.

Edit: Another limitation on Firefox is lack of service worker support from extensions origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1344561). This is needed to view what you've archived in the extension. Would need to work around this issue somehow until that is supported, so probably a bit of work, unfortunately.

I tried to start implementing it, though I didn't get that far. So here's how I understand the structure of the code and my thoughts for now:

From the popup when pressing "record", in bg.js the startRecorder function is called, which creates a new BrowserRecorder. BrowserRecorder extends Recorder with a few browser-specific things (as opposed to ElectronRecorder). This recorder attaches the Chrome debugger to the currently focused tab, and adds handlers for a lot of different events:


There seems to be four methods how requests are fetched:

  1. The requests the page itself makes, intercepted by Fetch.getResponseBody
  2. Some files that the webiste refers to (images etc) in the injected autofetcher.js script. The responses are thrown away since they are then intercepted by (1) as well.
  3. In doAsyncFetchInBrowser a fetch script is injected into the page to request the full data. The data is also captured in (1). This is called for popup windows.
  4. doAsyncFetchDirect. Used to work around partial (HTTP 206) responses, favicons, and the kLoad media event special case. The data here is directly captured and written to the DB instead of going through 1 like all the other methods.

Since 1-3 all use Fetch.getResponseBody, adding StreamFilter should make those work on Firefox. 4. directly uses the fetch api, which is supported in Firefox anyways.


For Firefox support, these things would need to be changed:

  1. recorder.js and browser-recorder.js both are pretty tightly bound to the chrome debugger protocol. Those parts of the code would probably have to be abstracted out into a separate class.
  2. The StreamFilter API would be an alternative implementation to the chrome debugger Fetch.* api, used to get and modify the actual payload / body of the request responses.
  3. All the other chrome api calls could be investigated if they are actually needed or could be replaced with the chrome.webRequest api. Maybe not possible because that strips some security related headers?
  4. Regarding the missing support for service workers in the extension, I've encountered that issue before, and I hope they'll fix that. I think for now the easiest workaround would be to just have the warc export button without the interactive browser. Or have a hosted website (or static html file) that can access the data from the extension via an API?
// example of StreamFilter
browser.webRequest.onBeforeRequest.addListener(
  listener,
  {urls: ["<all_urls>"], tabId: this.debuggee.tabId},
  ["blocking"]
);

function listener(details) {
  let filter = browser.webRequest.filterResponseData(details.requestId);

  let data = [];
  filter.ondata = event => {
    data.push(event.data);
  };

  filter.onstop = async event => {
    let blob = new Blob(data, {type: 'application/octet-stream'});
    let payload = await blob.arrayBuffer();
    // do filter stuff as in chrome code
    filter.write(payload);
    filter.close();
  };
}
ikreymer commented 3 years ago

Wow, thanks for the thorough analysis! I think that's mostly correct, nice work!

The Fetch domain is the main handler, although Network is also used in case some headers are missing. (In older versions of Chrome, some requests/responses did not get to Fetch but did go through Network, so both are being intercepted)

The media event was for an edge case where the URL was not otherwise being intercepted for some reason, may not be an issue with webRequest Some of the rewriting is needed for both capture and replay, to get a fixed resolution. That should also be possible with the webRequest.

One question I'd have is around timing with webNavigation and webRequest. Hopefully, the webNavigation occurs before the first request on that page, so that all subsequent requests can be associated with the new page.

And yes, would probably have a separate 'firefox-recorder' that uses this system instead of the CDP approach used in extension and electron app.

So this all could be done, at least to test it out. But now I think the lack of service worker is actually the bigger issue, as you then can't really see what you've archived, and that messes up the whole workflow. I think it would need to be supported somehow for the extension to be fully usable..

Do you have any thoughts on how to solve that? My only thought was to install the service worker on perhaps a replay.archiveweb.page domain, and then have the service worker proxy to the actual extension via https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/runtime/Port, so that the extension can read from its IndexedDB.

However, I'm not certain a service worker can get an access to that port.. possibly need a content script which gets the port and the transfers it to the service worker.. This would require some work in wabac.js as well... But, would be curious to know if this is possible.

I don't have the time to work on this now, but happy to provide guidance and accept a PR if you're willing to try this out..

fivestones commented 3 years ago

I don't have the time or knowledge to help port this right now, but I'd love to use this extension in Firefox too!

phiresky commented 3 years ago

Thanks for your response. I'm not sure if and when I'll have time for this, so we'll see if something happens ;)

What I really want is a permanent complete log and archive of every site I visit (#13), with full text search indexing of everything. That would probably need storage in a external database (I'm thinking PostgreSQL or elastic search) so it's an even larger effort for this to be really useful to me.

phiresky commented 3 years ago

Do you have any thoughts on how to solve that? My only thought was to install the service worker on perhaps a replay.archiveweb.page domain, and then have the service worker proxy to the actual extension

I tested it, and it does work by creating a content script in the webextension that adds an iframe of the webextension to the replayweb.page. The content script creates a MessageChannel, which it then passes to the service worker and to the iframe. The service worker then sends RPC calls to the iframe, and the iframe reads from the IndexedDB. It's kinda ugly though. I couldn't figure out how to create a MessageChannel between the service worker and the extension background page (so iframe is not needed) because the chrome.runtime.sendMessage() api does not support "structured clone" which is needed to transfer MessagePorts, and window.postMessage doesn't work between a content script and a background page.

example: content_script.js

    const iframeurl = chrome.runtime.getURL("iframe.html");
    console.log(
        "creating message channel between website service worker and webextension"
    );
    const channel = new MessageChannel();
    const iframe = document.createElement("iframe");
    iframe.style.display = "none";
    iframe.src = iframeurl;
    iframe.onload = () => {
        console.log("moz-ext iframe loaded");

        iframe.contentWindow.postMessage(
            { msg_type: "port-to-replayweb", port: channel.port1 },
            iframe.src,
            [channel.port1]
        );
        navigator.serviceWorker.controller.postMessage(
            {
                msg_type: "port-to-webextension",
                port: channel.port2,
            },
            [channel.port2]
        );
    };
    document.body.appendChild(iframe);
cooljeanius commented 2 months ago

Thanks for your response. I'm not sure if and when I'll have time for this, so we'll see if something happens ;)

What I really want is a permanent complete log and archive of every site I visit (#13), with full text search indexing of everything. That would probably need storage in a external database (I'm thinking PostgreSQL or elastic search) so it's an even larger effort for this to be really useful to me.

I have wanted something like this, too, but I'd be kind of worried about how quickly it'd fill up, and about recursion when the sites I visit are themselves archiving sites...