Closed ikreymer closed 1 month ago
I confirm this fix is working as intended in warc2zim. Thanks a lot for the quick fix!
One last question: do I get it correctly that all DS rewriting rules are automatically applied by Browsertrix Crawler? So far Kiwix understanding was that we needed to reimplement both fuzzy rules and DS rewriting rules in warc2zim since we rely only on wombat.js. But this PR comment + the fact that the fix is working just by updating to last crawler + the fact that rewriting is indeed already present in WARC seems to prove we are totally wrong, and only fuzzy rules are needed.
I confirm this fix is working as intended in warc2zim. Thanks a lot for the quick fix!
One last question: do I get it correctly that all DS rewriting rules are automatically applied by Browsertrix Crawler? So far Kiwix understanding was that we needed to reimplement both fuzzy rules and DS rewriting rules in warc2zim since we rely only on wombat.js. But this PR comment + the fact that the fix is working just by updating to last crawler + the fact that rewriting is indeed already present in WARC seems to prove we are totally wrong, and only fuzzy rules are needed.
Yes, the DS rewriting are applied at crawl time, and we actually store the rewritten response now. (It was changed at some point, maybe even in ArchiveWeb.page) to make it easier for the replay, since otherwise the same rewriting would need to be applied in replay. I believe this means these rules don't need to be applied again at replay time, though we do apply them for JS, but not the new HTML rule that is added here.
OK, thank you for the explanation!
Export HTML Rx rewriter, to be used at capture time. For now, only to be used externally in Browsertrix Crawler and ArchiveWeb.page. Necessary to fix youtube capture & replay, issues #181
Supports a new YT-specific rule of disabling MediaSource.isTypeSupported():
May need to change later, but first initial implementation to solve urgent issue.