webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
172 stars 32 forks source link

Stop crawls if org storage space is capped & full #579

Closed ldko closed 1 year ago

ldko commented 1 year ago

Heritrix has a Disk space monitor that "Monitors the available space on the paths configured. If the available space drops below a specified threshold a crawl pause is requested."

I would find something similar helpful for local deployments of Browsertrix Cloud where if there is limited space left where crawl files are being written, crawls are paused. While the size of crawl content can be configured currently, if a crawl tries to exceed what is actually available and fills the available space 100%, in the case of a deployment where WACZ files are being written to the same place as the microk8s clusters etc., it takes down the whole system.

May be related to #427 .

tw4l commented 1 year ago

May want to implement in crawler - see https://github.com/webrecorder/browsertrix-crawler/issues/242

tw4l commented 1 year ago

This has been implemented in the crawler, which will gracefully stop if disk utilization passes or is projected to pass a certain threshold (90% by default). This threshold is configurable in the crawler and we can make it settable via the helm chart values in Btrix Cloud.