webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
143 stars 29 forks source link

Config scheduling - Hourly #613

Open SB-JM opened 1 year ago

SB-JM commented 1 year ago

I miss more options in Scheduling/Frequency, multiple harvests per hour: For example harvesting every hour ("once_a_time"). Important if you want to follow how the front page of a news media continuously changes over the course of a day Frequency_Scheduling

Shrinks99 commented 1 year ago

Will probably be addressed around the same time as https://github.com/webrecorder/browsertrix-cloud/issues/389 ?

Hourly requires a little more thought than yearly though... What would you expect to happen if a crawl is still running on the hour when the config is set to auto start again? Should it stop the existing crawl? Should it continue the existing crawl and start a new one?

EDIT: Currently because workflows can only be in a running or not running state if they are scheduled while running, they will not start crawling again while a workflow is running. Ideally this would produce a notification in the future telling users that a scheduled crawl was skipped.

tuehlarsen commented 2 months ago

At Royal Danish Library we need to crawl e.g newssites frontpages many times during a day because they change very often and specially by breaking news. We have today Heritrix jobs running in many different schedules and depth e.g frontpages up to 12 times per day , 2 hops down 1. time per week, 3-4 hops down 1. time per month. If a job is still running when it try to start a new a schedule it should note the user or admin about postboned schedules. It should never start a new crawl with the same crawl name.
see also https://github.com/webrecorder/browsertrix/issues/1372