Pausing / Scaling Crawls - Githubissues

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

https://webrecorder.net/browsertrix

GNU Affero General Public License v3.0

201 stars 35 forks source link

Pausing / Scaling Crawls #2

Closed ikreymer closed 3 years ago

ikreymer commented 3 years ago

Support increasing/decreasing number of pods running on a crawl. Requires:

[ ] Generate Crawl ID separately, not based on job/docker container id
[x] Use Shared Redis for Crawl, instead of local one in Browsertrix Crawler (for how long??)
[ ] Crawl doc supports multiple file entries instead of just one
[ ] Decide which approach to take, 1 or 2:

1. Scale via Pause and Restart

To scale:

job stopped gracefully, WACZ written
crawl doc set to 'partial_complete', files added for each completed pod.
restart with same crawl id, shared redis state.
final pod adds final WACZ, sets crawl state to 'complete'

Pros:

Maintain one job per crawl at a time.
K8s takes care of parallelism, works with cron job.

Cons:

Scaling up or down requires stopping job, restarting with more pods.
harder to support via Docker only

2. Add more jobs / remove jobs to scale

To scale up:

New bob added with crawl id of existing Redis state.

To scale down:

One or more existing jobs stopped (graceful stop)
crawl doc updated with new WACZ and 'partial_complete'

Pros:

Scaling up and down without any interruption
Can be implemented in similar way w/o K8S

Cons:

multiple jobs per crawl
unclear how to handle cronjobs

ikreymer commented 3 years ago

Using a hybrid approach:

K8S - Scale via Adjusting Parallelism

Based on 1 above.

Parallelism of existing jobs can be adjusted to scale up or down
Parallelism can be set on cronjob

Docker

TODO - Probably will need to be based on 2 above.

ikreymer commented 3 years ago

{crawl}/scale endpoint can be used to scale w/o pausing, supported via k8s only for now. No longer need to pause to scale, closing.