webrecorder / wombat

Wombat.js client-side rewriting library
GNU Affero General Public License v3.0
81 stars 30 forks source link

warc2zim: rewriting iframe with srcdoc does not work as intended #155

Closed benoit74 closed 3 months ago

benoit74 commented 3 months ago

I'm here to seek help on https://github.com/openzim/warc2zim/issues/293

Investigating a bit this issue, it looks like it is linked to the fact that the website uses an <iframe> with a dynamically populated srcdoc attribute, containing the code the user just edited on the left.

Wombat is somehow "breaking" this in current warc2zim setup (and there is a significant chance it might be "our fault").

When we initialize wombat in warc2zim, we use the isSW property set to true, because it seems to be the proper way to configure wombat for proper operation in our setup. Unfortunately, it is still unclear why we need to set this to true, this is purely black magic for us for now. Btw, if you could have a look at https://github.com/openzim/warc2zim/issues/239 and explain a bit these "black magic" parameters or at least confirm they make sense, it would be greatly appreciated.

It looks like this setting is however unfortunate for iframe with srcdoc property ; it removes the srcdoc attribute of the iframe, and replaces it with a src attribute pointing to what looks like a "magic" _mp/blob url, which in our case is not handled in the ZIM.

Investigating a bit further, I achieved to nail down the problem to the overrideHtmlAssignSrcDoc function of wombat, and more specifically to this section:

if (wombat.wb_info.isSW) {
  wombat.blobUrlForIframe(this, orig);
  return orig;
} else {
  return wombat.rewriteHTMLAssign(this, orig_setter, orig);
}

If I disable the switch to call rewriteHTMLAssign instead of blobUrlForIframe (typically by writting if (false && wombat.wb_info.isSW) {) and use this modified wombat.js in warc2zim, then the <iframe> of MDN works like a charm.

Unfortunately, I'm far from having a reasonable understanding of the situation, so your guidance is more than welcomed: does it make any sense? should we add a new switch to wombat config to specifically disable this? is this going to break in other conditions and we should avoid doing this (I typically imagine doing what I've done means that no more URL rewriting is done in the srcdoc)? Should we implement something in warc2zim / libzim to handle these blobs? (but then they cannot be modified dynamically by the website anymore)

ikreymer commented 3 months ago

The wombat.wb_info.isSW is to indicate that its loaded in a service worker. For your replay, you should not be setting that field or set it to false when you declare the wbinfo in the <head>. It is an option for wombat to indicate service-worker based replay, so the other path is correct. There's a couple of other places that check for that as well.

ikreymer commented 3 months ago

This should be false for your use case: https://github.com/openzim/warc2zim/blob/main/javascript/src/wombatSetup.js#L291

benoit74 commented 3 months ago

Issue we had when setting isSW to false was that some rewriting was not working properly. For instance Youtube player was not working anymore. Is it correct that means we should find the root cause to these problems and aim for isSW: false anyway, because isSW: true is going to lead to more problems than solutions?

ikreymer commented 3 months ago

Issue we had when setting isSW to false was that some rewriting was not working properly. For instance Youtube player was not working anymore. Is it correct that means we should find the root cause to these problems and aim for isSW: false anyway, because isSW: true is going to lead to more problems than solutions?

Yes, I think so, since isSW assumes there is a service worker available for certain types of requests, such as loading blobs dynamically, and deals with other restrictions around service workers, which don't occur in non-service worker based replay.

Youtube is unrelated to this, working on a fix for that.

ikreymer commented 3 months ago

Closing this for now as I think it answers the original question, and the non-service worker path is working as intended.

benoit74 commented 3 months ago

OK, thank you! When I mentioned Youtube, I meant "the old Youtube player" which was working in replayweb and in warc2zim. I'm totally fine with this been closed, it addresses the problem described here properly. I might need to open an issue for the non-service worker path, but this is not sure at all.