webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
631 stars 82 forks source link

Failing to create a login profile for www.solidarite-numerique.fr #581

Open benoit74 opened 5 months ago

benoit74 commented 5 months ago

I'm trying to create a login profile for www.solidarite-numerique.fr, in order to set cookies which will disable the display of banners highlighted in green in screenshot below.

image

Banner 1 is removed when a cookie named VisitorAgree with value ACCEPTED is set.

Banner 2 is removed when a cookie named BandeauCheck with value MASQUEE is set.

(or currently, it seems that setting any of these cookies is sufficient to remove both banners, but it is maybe a bug, at least it wasn't working like this few weeks ago, anyway, this is not the problem).

I start the creation of login profile with following command:

docker run -p 9223:9223 -p 6080:6080 -v $PWD/output/profiles/www.solidarite-numerique.fr:/crawls/profiles/ --rm -it webrecorder/browsertrix-crawler:1.1.2 create-login-profile --url "https://www.solidarite-numerique.fr/"

The first problem I have is that when I open the browser at http://localhost:9223, the website is displayed inside Brave window as expected, but the banners are already gone, so I cannot click the "close" button to set the cookie.

This would be cool if the banner was also not retrieved when crawling, but this is not the case, the banner is present "inside the WARC". I tried to disable Brave Shields should it be the problem, but it changed nothing.

Any idea on how to workaround this? For now I will probably workaround this by adding a custom CSS on top of existing ones to hide the banners (we are lucky to have this feature in warc2zim ^^) but I would like to understand why creating the login profile is not working as expected.

The second problem is that I get a strange message in the logs when creating the profile. It is just a warning and the profile is still created, but it seems a bit worrying:

{"timestamp":"2024-05-22T06:37:29.844Z","logLevel":"warn","context":"general","message":"Page Load Failed/Interrupted","details":{"type":"exception","message":"Navigating frame was detached","stack":"Error: Navigating frame was detached\n    at #onFrameDetached (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/LifecycleWatcher.js:97:47)\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/mitt.js:36:7\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/mitt.js:35:20)\n    at CdpFrame.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:77:23)\n    at #removeFramesRecursively (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/FrameManager.js:393:15)\n    at #onClientDisconnect (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/FrameManager.js:88:42)"}}
rgaudin commented 5 months ago

Another temporary workaround might be to manually set those two cookies via the Brave UI…

benoit74 commented 5 months ago

Another temporary workaround might be to manually set those two cookies via the Brave UI…

I tried but then it got worse, other errors appeared when saving the profile speaking about bad cookies

ikreymer commented 4 months ago

The second problem is that I get a strange message in the logs when creating the profile. It is just a warning and the profile is still created, but it seems a bit worrying:

This warning can be ignored, just means the browser is loading things when the profile is saved and closed. Since we don't compare about the page loading fully here, that should be fine.

I wasn't quite able to repro this - perhaps the banners are different now, or are based on location? Is this still an issue?

benoit74 commented 4 months ago

This warning can be ignored

OK, perfect

I wasn't quite able to repro this - perhaps the banners are different now, or are based on location? Is this still an issue?

Do you mean that when you open the browser at http://localhost:9223 the banners are visible on your setup? I just tried again and the banners are still not visible. This is not a location issue because I tried running the container on my laptop (so same IP than my "normal" browser) and the banner still don't appear when creating the profile but are there on my "normal" navigation.