webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
620 stars 81 forks source link

What exactly is '--blockRules' blocking? Entire URLs where an element like an iframe matches a regex, or only the matching part of a page? #574

Open steph-nb opened 4 months ago

steph-nb commented 4 months ago

Hi, to me it looks like '--blockRules' blocks entire pages, when a subelement like an iframe-content's URL is matching a passed regex. Is that correct? Or what is the exact mechanism?

And if my assumption was true, would it be nice to have an option to only exclude exactly the matching elements, but collect the rest of a page?

Many thanks

tw4l commented 4 months ago

Hi @steph-nb, the block rules target requests from specific URLs, so if you have a page at example.com with an iframe loading content from othersite.com and add a block rule matching othersite.com, the overall page at example.com should still be captured but the iframe content from othersite.com should be blocked.

If you're seeing behavior that deviates from this, I'm happy to look into it further!

steph-nb commented 4 months ago

Hi @tw4l , many thanks for your answer. I am not yet sure, if really the beaviour of browsertrix-crawler or my syntax of using crawler_extra_args in browsertrix is wrong. How would you enter multiple regexes to blockRules in crawler_extra_args of the value.yaml in browsertrix, to block all matching contents on any page visited?

For example I want to use these regexes: image

BR and thanks a lot!

steph-nb commented 3 months ago

Hi @tw4l , I retried several ways to configure this parameter via the values.yaml of browsertrix. Here some examples: a) crawler_extra_args: '--rolloverSize 100000000 --blockRules [".youtube.",".facebook.",".stats\.i-web\.ch.",".stats4\.i-web\.ch.",".onLogin.",".start_date.",".matomo."]'

b) crawler_extra_args: '--rolloverSize 100000000 --blockRules ["youtube"]'

It always resulted in blocking much more than the desired page-elements only. See for instance: image

Question 1: How would you pass this parameter via values.yaml?

Question 2: If my ways should already be fine, could you maybe rework the functionality to really only exclude matching elements?

Many thanks and BR