webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
165 stars 30 forks source link

[Bug]: Running out of space halts job. Not able to restart them after more space availible. #2035

Open thsm-kb opened 3 weeks ago

thsm-kb commented 3 weeks ago

Browsertrix Version

v1.11.0-4aca107

What did you expect to happen? What happened instead?

Ran out of space. Expected jobs to pause gracefully and resume on free space. Jobs halted and was unable to resume. Also no warning or statusmessage prior to or when running out of storage.

Reproduction instructions

1 Run out of space 2 See what happens

Screenshots / Video

No response

Environment

No response

Additional details

No response

Shrinks99 commented 3 weeks ago

This is intended behavior as we currently have no way of resuming a stopped crawl and picking up where you left off. The feature issue for resuming a stopped crawl covers a few other points of friction and I have now added "out of disk space" as a user story. Until that is addressed however, our best advice would be to ensure you have a reasonable amount of space left within your org to avoid this occurring in the future!

Dupe of #1753

ikreymer commented 3 weeks ago

Henry is referring to our hosted service, however, I think this is about self-deployment if I understand correctly?

@thsm-kb it would be helpful if you provided more information about your configuration and how you're running out of space, as I understand you're self-deploying this. The architecture of the system is to allocate temporary volumes of a fixed size, and upload the data to an S3 bucket when the crawl stores enough data on the volume or it reaches a certain percentage of the volume disk. The default requested size is configured at 26Gi to support WACZs files up to 10Gb, however, its possible to set these values differently, eg. you could have a maximum WACZ files of 1GB and allocate 10Gb of storage, just in case. Of course, when crawling large files is involved, the WACZ files can exceed the size, as we can't split larger video files, etc... In either instance, crawl should not halt, but rather create a WACZ file and upload to S3 bucket if the temporary space is exceeding the threshold. The system will also wait until it can allocate a volume of a certain size, which is guaranteed storage. The crawler does assume that it can upload to S3 bucket without any limits. Without more information, its hard to understand what is happening, and where the 'running out of space' is happening.

thsm-kb commented 3 weeks ago

Correct, it is self-deployment. We simply ran out of storage while crawling 580.000 .dk frontpages.