webrecorder / replayweb.page

Serverless replay of web archives directly in the browser
https://replayweb.page
GNU Affero General Public License v3.0
679 stars 54 forks source link

[Bug]: Missing ads on news sites #266

Open tuehlarsen opened 9 months ago

tuehlarsen commented 9 months ago

Browsertrix Cloud Version

v1.8.0-beta.4-7d985a9

What did you expect to happen? What happened instead?

Missing ads on most used news sites. replay of news sites are missing most of the ads - some are traced with Archived Page Not Found or not displayed and a few displayed. All ads can be seen in watch crawl window.

Step-by-step reproduction instructions

e.g.

politiken.dk crawl: "pol frontpage with all context" https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/sched-bb9b135d-357-28341060?workflowId=bb9b135d-3573-4901-bdef-a80d35a15741#replay Archived Page Not Found Sorry, this page was not found in this archive: https://0e9755db0ca066211b5983705fdb4922.safeframe.googlesyndication.com/safeframe/1-0-40/html/container.html?n=2

tv2.dk crawl: tv2.dk frontpage complete context incl. ads https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/manual-20231118064936-03e01f26-37d?workflowId=03e01f26-37dd-4fa6-880f-db7bd6dd6679

berlingske.dk frontpage with context crawl: https://beta.browsertrix.cloud/orgs/netarkivet-det-kgl-bibliotek/items/crawl/manual-20231118095211-a4e6bc32-473?workflowId=a4e6bc32-4733-4a3f-8231-43b6df1c4031#replay

Additional details

No response

Shrinks99 commented 9 months ago

This may be a result of switching to Brave browser which has more agressive privacy settings by default. These should be able to be disabled on a per-browser profile basis, but should likely be off by default unless the "block ads" setting has been enabled by the user.

In the mean time, try creating a browser profile with some of Brave's "Shields" settings disabled. Screenshot 2023-11-20 121639

tuehlarsen commented 9 months ago

I’m using the beta.browsertrix GUI v. 1,8* with no blocking of ads and I can’t change crawling browser. To me the ads replay seems much better than for a year ago. During crawl I can see all the ads in the crawl windows, so the crawler sees the ads. Some of the ads are replayed fine, but not all. I think, it’s “only” a question about harvesting url’s and replay 😊.

Best regards Tue

tuehlarsen commented 5 months ago

If you download https://beta.browsertrix.cloud/orgs/kb/items/crawl/manual-20240323083932-bb9b135d-357?workflowId=bb9b135d-3573-4901-bdef-a80d35a15741#files:~:text=20240323084140064%2Dbb9b135d%2D357%2D0.wacz
and load the wacz file offline with replay webpage 2.00.beta it replays the ads which are harvested. But if you unzip the file and only load the warc.gz here kb-pol-frontpage-with-all-context-manual-20240323083932-bb9b135d-357-20240323083954557-0.warc.gz the replay of https://politiken.dk does not show any ads, but they are all in the warc.gz file and can be replayed using the image/audio/video url list. The same in pywb no ads in replay.