webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
201 stars 35 forks source link

New Incremental Crawling + Time-Slicing Paradigm #219

Open ikreymer opened 2 years ago

ikreymer commented 2 years ago

In the current setup, the crawl runs to completion, and can be scaled up and down (in K8s). If a pod fails, the crawl can be re-started if the volume is available (eg. a shared NFS). A crawl can also be scaled up and down by adding additional parallel pods to the job. However, each job runs to completion.

This is not scalable for several reasons: really long running crawls can block other crawls from starting, interruptions are not guaranteed to restart, state saved in redis can grow fairly large, the final WACZ files can get fairly large.

This issue is to track a more scalable approach to large crawls:

Additional considerations, it would be interesting to explore the idea of a time-shared round-robin jobs, eg. if a cluster can only support 10 crawls, but users submit 100, need a way to rotate them out at fixed intervals. It seems like the new job suspension is designed to support exactly this sort of issue, though requires additional work, eg. a 'higher order controller' https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2232-suspend-jobs#story-2

ikreymer commented 2 years ago

Already implemented now:

Not yet implemented: