webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
657 stars 83 forks source link

Debugging mode with short videos #133

Open despens opened 2 years ago

despens commented 2 years ago

The screencast option is very useful to observe how websites might cause the crawler to hang, for instance because of cookie banners, captchas, etc.

It would be great if there was a mode that instead of capturing a web archive would capture a video of a single worker crawling. This could be used to check if any issues would have to be expected during crawling. As the crawling browser doesn't feature a full user interface that displays the current URL, a plaintext subtitles file (in srt format or similar) could be generated for the URL to appear in the video.

Obviously it would be best limited to a small amount of pages or overall crawl time.

ikreymer commented 2 years ago

Yes, it's an interesting idea, was thinking perhaps a 'video log' to accompany a crawl, that could be treated as a log in addition to a regular text log. Of course, for long running crawls would need to break this up into smaller chunks, eg. a crawl could be running for several days!

despens commented 2 years ago

TheA video of a full crawl could grow to a massive size indeed. But for starters relying on the user to limit the crawl accordingly to their available capacity could be enough, especially if it is conceptualized as a debugging tool. This could be a m3u list with subtitled videos in the end. :)