webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
201 stars 35 forks source link

Pausing / Scaling Crawls #2

Closed ikreymer closed 3 years ago

ikreymer commented 3 years ago

Support increasing/decreasing number of pods running on a crawl. Requires:

1. Scale via Pause and Restart

To scale:

  1. job stopped gracefully, WACZ written
  2. crawl doc set to 'partial_complete', files added for each completed pod.
  3. restart with same crawl id, shared redis state.
  4. final pod adds final WACZ, sets crawl state to 'complete'

Pros:

Cons:

2. Add more jobs / remove jobs to scale

To scale up:

To scale down:

Pros:

Cons:

ikreymer commented 3 years ago

Using a hybrid approach:

K8S - Scale via Adjusting Parallelism

Based on 1 above.

Docker

TODO - Probably will need to be based on 2 above.

ikreymer commented 3 years ago

{crawl}/scale endpoint can be used to scale w/o pausing, supported via k8s only for now. No longer need to pause to scale, closing.