webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
170 stars 32 forks source link

[Feature]: auto cookie handling (as a behaviour) #1408

Open thsm-kb opened 10 months ago

thsm-kb commented 10 months ago

Context

From a european point of view cookies are troublesome. Most sítes are forced to ask the user to accept cookies due to the ePrivacy Directive. And we don't want to make browserprofiles for every single domain, to accept the cookies. Content can be left out, if cookies are not accepted.

What change would you like to see?

As a user I would like to be able to select a behaviour (auto_cookies) that handles cookies for me, like https://www.i-dont-care-about-cookies.eu/ does, so that I get good content with minimal effort (that's who I am!)

Requirements

No response

Todo

No response

tw4l commented 10 months ago

Now that Browsertrix Crawler is based on Brave, it's possible to take advantage of its in-built features to hide cookie popups. For reference, see: https://brave.com/privacy-updates/21-blocking-cookie-notices/

The way we built Brave, you won't see the option to click yes, but instead you can enter the URL for the Easylist Cookie List under "Add custom filter lists" at brave://settings -> Shields -> Content filters.

This should let you make a single browser profile that will hide cookie popups no matter the domain.

This needs further investigation but I believe the Brave filter works by blocking some URLs but merely hiding the cookie popup for others, so it's possible that pop ups might still show up in replay. In some quick testing, I found that the cookie popup for BnF's site (https://www.bnf.fr/fr) was not present in replay, while the European Central Bank's (https://www.ecb.europa.eu/home/html/index.en.html) was.

tuehlarsen commented 7 months ago

On most of the danish web sites a left out accept of the cookie blocks for showing/harvesting the integrated adds and other website "features". So it is not only a question about suppressing the the cookie dialog but to accept it. And the used cookie dialog plugin are changed very often - not by the website i guess - but by the cookie plugin producer, so you are forced to maintain the accept over time. I have followed different daily crawls of 4 newssites with cookies enabled over a month and i have to update the cookie accept browserprofile allmost each 3. day for some of the sites! It was very annoying... see e.g. https://tv2.dk/ and https://ekstrabladet.dk/