webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
143 stars 29 forks source link

[Bug]: No ads in replay on some sites eventhough the ads are shown in the brave profile or online #1606

Open tuehlarsen opened 3 months ago

tuehlarsen commented 3 months ago

Browsertrix Version

v1.9.4-08ee857

What did you expect to happen? What happened instead?

After the last opgrade to 1.9.4 the ads are not shown any more in replay for tv2.dk even though they are visible in the browserprofile: here the replay af tv2.dk:
image and here the the browser profile : tv2.dk med accept af cookies : image

The same happens for berlingske.dk but here it is not possble to see the adds in the browser profile too - eventhough i have disabled all shields in the brave settings. Here a snip of adds from berlingske.dk online: image and here the browser profile: image and here the brave:setttings for shields: image I can see all ads in brave with disabled shields here:

image image

but strangely enough - it still works for politiken.dk here in replay: image

Reproduction instructions

see above

Screenshots / Video

No response

Environment

No response

Additional details

No response

tuehlarsen commented 3 months ago

I checked this morning again and only replay of berlingske.dk can't show the ads. tv2.dk and politiken.dk are replaying some of the ads. Any hints to what could be wrong with the setup of berlingske.dk concerning ads?

tw4l commented 3 months ago

Hi @tuehlarsen , in 1.9.4 we changed the default crawler version to the latest 1.0.0 beta, that may be responsible for the change. Could you try that crawl again with the "Previous" crawler channel (which is set to 0.12.4) to see if that works? You can find the crawler channel selector in Edit Workflow under Browser Settings:

Screen Shot 2024-03-18 at 1 28 05 PM

Here's the relevant section in the docs: https://docs.browsertrix.cloud/user-guide/workflow-setup/#crawler-release-channel

tuehlarsen commented 3 months ago

I tried with the previous crawler version with berlingske.dk - it just ignores the browser profile totally and the accept of cookies. With the default crawler it crashes again and again with interrupt: 139.

ikreymer commented 3 months ago

I tried with the previous crawler version with berlingske.dk - it just ignores the browser profile totally and the accept of cookies. With the default crawler it crashes again and again with interrupt: 139.

The crash in this case was due to sitemap parsing - we have a fix for this shortly, webrecorder/browsertrix-crawler#496 - in the meantime, disable 'Use Sitemap' for this crawl and try agian.

tuehlarsen commented 3 months ago

now it runs but berlingske.dk with no ads or no ads traces in replay - i saw the ads during the crawl and no cookies accept popup, so it should use the browser profile. Allmost the same with ekstrabladet.dk In replay: there is a few adds in the midle columnpart of the frontpage and only empty black columns in columns to the left and right. The crawler saw all the ads to the left and right and in the midle column, but allmost no ads are shown in replay. here online snips: image image

Here some snips from the crawl: image image

And a snip from replay: image

tuehlarsen commented 3 months ago

I can see all adds in a brave browser from a danish ip without shields activated, so perhaps a browsertrix replay issue?

tuehlarsen commented 3 months ago

The different newssites use some different ads providers/frameworks e.g. with display of iframes with html etc. information.dk does not use google ads but https://www.adnami.io and shows no ads in replay, only empty spaces, while tv2.dk uses a mix of google ads and https://betterbannerscloud.com. berlingske.dk also uses a mix of google ads and https://www.adnami.io/ but uses the google framework in a different way than replay can handle. https://jyllands-posten.dk/ uses a mix https://www.adnami.io/ and google ads. The best ads replay appatizers are frontpage crawls of politiken.dk and tv2.dk eventhough some ads are missing and we are also running from not danish ip's. It seems to be a hard work to support these ads frameworks but i think it is important to have the most dominant supported in the replay of a newsites "look and feel" because they interact/overrun the news contents so massively.

tuehlarsen commented 3 months ago

Re berlingske.dk : When i use the archive.Webpage desktop version from oct. 2023 [ArchiveWeb.page-0.11.3.exe] i can replay traces of the ads and play the videos in the audio/video list : https://beta.browsertrix.cloud/orgs/kb/items/upload/upload-55d89b6d-7561-43e1-a392-76c9ecd89a4f#replay

tuehlarsen commented 2 months ago

progress: in version 1.9.7 information.dk shows danish ads or traces in offline replay webpage desktop - in stead of empty placeholders! see
image