Open thsm-kb opened 3 months ago
This is intended behavior as we currently have no way of resuming a stopped crawl and picking up where you left off. The feature issue for resuming a stopped crawl covers a few other points of friction and I have now added "out of disk space" as a user story. Until that is addressed however, our best advice would be to ensure you have a reasonable amount of space left within your org to avoid this occurring in the future!
Dupe of #1753
Henry is referring to our hosted service, however, I think this is about self-deployment if I understand correctly?
@thsm-kb it would be helpful if you provided more information about your configuration and how you're running out of space, as I understand you're self-deploying this. The architecture of the system is to allocate temporary volumes of a fixed size, and upload the data to an S3 bucket when the crawl stores enough data on the volume or it reaches a certain percentage of the volume disk. The default requested size is configured at 26Gi to support WACZs files up to 10Gb, however, its possible to set these values differently, eg. you could have a maximum WACZ files of 1GB and allocate 10Gb of storage, just in case. Of course, when crawling large files is involved, the WACZ files can exceed the size, as we can't split larger video files, etc... In either instance, crawl should not halt, but rather create a WACZ file and upload to S3 bucket if the temporary space is exceeding the threshold. The system will also wait until it can allocate a volume of a certain size, which is guaranteed storage. The crawler does assume that it can upload to S3 bucket without any limits. Without more information, its hard to understand what is happening, and where the 'running out of space' is happening.
Correct, it is self-deployment. We simply ran out of storage while crawling 580.000 .dk frontpages.
Browsertrix Version
v1.11.0-4aca107
What did you expect to happen? What happened instead?
Ran out of space. Expected jobs to pause gracefully and resume on free space. Jobs halted and was unable to resume. Also no warning or statusmessage prior to or when running out of storage.
Reproduction instructions
1 Run out of space 2 See what happens
Screenshots / Video
No response
Environment
No response
Additional details
No response