Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
In the current setup, the crawl runs to completion, and can be scaled up and down (in K8s). If a pod fails, the crawl can be re-started if the volume is available (eg. a shared NFS). A crawl can also be scaled up and down by adding additional parallel pods to the job. However, each job runs to completion.
This is not scalable for several reasons: really long running crawls can block other crawls from starting, interruptions are not guaranteed to restart, state saved in redis can grow fairly large, the final WACZ files can get fairly large.
This issue is to track a more scalable approach to large crawls:
browsertrix-crawler will periodically 'commit' crawled data, after a certain size or time threshold is reached, eg. after say 5GB or every hour to an 'intermediate' s3 target.
the crawl state is also written to this s3 target, and the done queue in redis can be cleared after the crawl state yaml is written.
job can be restarted from the last save state on the s3 bucket at any time, even if all jobs are interrupted / die.
when a crawl is done, lots of smaller WACZ files can be downloaded and concatenated into less number of WACZ files of a given size. This would require a final 'reducer' job of sorts.
Additional considerations, it would be interesting to explore the idea of a time-shared round-robin jobs, eg. if a cluster can only support 10 crawls, but users submit 100, need a way to rotate them out at fixed intervals.
It seems like the new job suspension is designed to support exactly this sort of issue, though requires additional work, eg. a 'higher order controller' https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs#story-2
In the current setup, the crawl runs to completion, and can be scaled up and down (in K8s). If a pod fails, the crawl can be re-started if the volume is available (eg. a shared NFS). A crawl can also be scaled up and down by adding additional parallel pods to the job. However, each job runs to completion.
This is not scalable for several reasons: really long running crawls can block other crawls from starting, interruptions are not guaranteed to restart, state saved in redis can grow fairly large, the final WACZ files can get fairly large.
This issue is to track a more scalable approach to large crawls:
Additional considerations, it would be interesting to explore the idea of a time-shared round-robin jobs, eg. if a cluster can only support 10 crawls, but users submit 100, need a way to rotate them out at fixed intervals. It seems like the new job suspension is designed to support exactly this sort of issue, though requires additional work, eg. a 'higher order controller' https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs#story-2