webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
613 stars 79 forks source link

Make screenshot after custom behaviors #486

Open cmillet2127 opened 6 months ago

cmillet2127 commented 6 months ago

Currently it seems screenshot are made before custom behaviors.

It could be very interesting to be able a post-custom behaviors screenshot. For example to capture screenshot after removing the "accept cookies" modals.

ikreymer commented 6 months ago

We are using Brave, and the accept cookies modals are actually removed by the browser before our custom behaviors are run, so the screenshot should actually reflect that, I believe. But, your point stands that it could be interesting to take a screenshot after autoscroll, etc..

cmillet2127 commented 6 months ago

In my situation, I'm encountering an unusual behavior while running Docker on Windows with the most recent image release. When utilizing the browsertrix-crawler, the 'accept cookies' modal persists. However, when navigating manually with Brave browser, the modal does not appear. Initially, I suspected that my image was still employing Chrome, but your confirmation of its use of Brave has led me to reconsider.

Example: docker run -v c:\tmp\crawls\:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://www.abarth.fr --generateWACZ final-to-warc --text --wait-until domcontentloaded --screenshot thumbnail,view,fullPage --scopeType page --blockAds

cmillet2127 commented 6 months ago

We are using Brave, and the accept cookies modals are actually removed by the browser before our custom behaviors are run, so the screenshot should actually reflect that, I believe. But, your point stands that it could be interesting to take a screenshot after autoscroll, etc..

Indeed, an additional suggestion might involve capturing a screenshot through a custom behavior using a 'utils' method. This approach would allow us to incorporate it into the WARC file, aligning with the methodology used for other screenshots.

ikreymer commented 6 months ago

If you run webrecorder/browsertrix-crawler it will use webrecorder/browsertrix-crawler:latest, which currently still points to the non-Brave version, unless you check out the repo and build it locally. You can try the latest beta release with webrecorder/browsertrix-crawler:1.0.0-beta.7. We hope to release the 1.0.0 version soon and then it will be latest.

fservida commented 1 month ago

@ikreymer is this still of interest? it would be extremely useful for us as some websites load images dynamically during scrolling, and therefore are missing if doing a fullpage screenshot before custom behaviours. I am quite lost in the code as unfamiliar with js, if pointed to right place of screenshot logic I can try something out and provide a PR