webrecorder / wabac.js

wabac.js - Web Archive Browsing Augmentation Client
https://replayweb.page
GNU Affero General Public License v3.0
96 stars 17 forks source link

Support HTML RX rewriting #182

Closed ikreymer closed 1 month ago

ikreymer commented 1 month ago

Export HTML Rx rewriter, to be used at capture time. For now, only to be used externally in Browsertrix Crawler and ArchiveWeb.page. Necessary to fix youtube capture & replay, issues #181

Supports a new YT-specific rule of disabling MediaSource.isTypeSupported():

<script>window.MediaSource.isTypeSupported = () => false;</script>

May need to change later, but first initial implementation to solve urgent issue.

benoit74 commented 1 month ago

I confirm this fix is working as intended in warc2zim. Thanks a lot for the quick fix!

One last question: do I get it correctly that all DS rewriting rules are automatically applied by Browsertrix Crawler? So far Kiwix understanding was that we needed to reimplement both fuzzy rules and DS rewriting rules in warc2zim since we rely only on wombat.js. But this PR comment + the fact that the fix is working just by updating to last crawler + the fact that rewriting is indeed already present in WARC seems to prove we are totally wrong, and only fuzzy rules are needed.

ikreymer commented 1 month ago

I confirm this fix is working as intended in warc2zim. Thanks a lot for the quick fix!

One last question: do I get it correctly that all DS rewriting rules are automatically applied by Browsertrix Crawler? So far Kiwix understanding was that we needed to reimplement both fuzzy rules and DS rewriting rules in warc2zim since we rely only on wombat.js. But this PR comment + the fact that the fix is working just by updating to last crawler + the fact that rewriting is indeed already present in WARC seems to prove we are totally wrong, and only fuzzy rules are needed.

Yes, the DS rewriting are applied at crawl time, and we actually store the rewritten response now. (It was changed at some point, maybe even in ArchiveWeb.page) to make it easier for the replay, since otherwise the same rewriting would need to be applied in replay. I believe this means these rules don't need to be applied again at replay time, though we do apply them for JS, but not the new HTML rule that is added here.

benoit74 commented 1 month ago

OK, thank you for the explanation!