Speed up the startup time for a crawl job by reducing Redis service discovery time

leepro commented 1 year ago

Currently, we use a dedicated redis instance for a crawl job. It takes some amount of scheduling time and overhead (CPU/Memory) to bring its own redis instance. There are two ways to try for this issue. 1) Create a Redis container in the crawler pod 2) By sharing one or more of redis instances per node, we can reduce the startup time for crawl jobs.

Scenario 1

Remove a seperate Redis instance creating by a crawl job
Put the redis instance together with inside its crawler pod
Note: this is to see how it reduces the startup time

Scenario 2

Make a crawl job use its associated k8s node's Redis instance (as a daemonset)
Make crawl jobs know the Redis URL before its launch

ikreymer commented 1 year ago

The initial implementation had Scenario 2. The reason for moving away from this was that it was a 'single point of failure', where if the redis instance became over-burdened, it would break all crawls: Ex: one crawl ends up getting to a queue of 100000+ pages, and there is no memory for any other crawls, so all other crawls break. Of course, could address this with better limits, but still it means all crawls share a point of failure.

Scenario 1 the issue was supporting dynamic scaling of the StatefulSet. Currently, we are just using a single Redis per crawl that and that works well. But if the Redis is part of StatefulSet pod, then all crawlers must connect to pod index 0 (since only that pod is writable). This means if we have, say 4 crawler replicas, but pod 0 is down, the other three can't run, because Redis is writable only on 0. Or, we need to make sure there is proper failover / use Redis Cluster, which means only odd number of replicas...

The current architecture was designed for flexibility, so even if just one crawler instance is available, it can still run, and redis pods could be on a different node (eg. dedicated vs spot instance) I wonder if there is other things we can do to speed up startup time? Redis itself is very fast, it is just the service discovery that takes longer..

Another idea: what if we pre-create a pool of redis pods, and just assign one to a crawler as it comes in, that way don't have to load it on demand?

leepro commented 1 year ago

Now I see what has been considered for the current design.

Initial intention of this ticket is to reduce the start-up time becase it takes few seconds for the service discovery (this is due to the k8s scheduler’s delay that is induced by scheduler/controller picking up and provisioning the resource).

I think we can consider a pre-created redis pool as you mentioned and but we can split it as two classes (or more) depending on the size of the jobs, i.e. the number of pages:

For a single page or few pages jobs (N) (less than M page, e.g. N<=M), uses 10% of the pool as a shared redis instance.
- Quick response time for the end-user and not much burdens for the shared instances.
- By having at least 2 instances for this class. It can mitigate any single-point of failure situation.
For a larger pages jobs: N >= M, uses 90% of the pool as an dedicated redis instance (1:1 assignment).
- Actually, we can keep the current way of using Redis for this class jobs.
Backend (API) server manages an assignment.
- Initially, it creates a number of Redis instances (X) and assign its TCP port from 4001 to 4000+X. Internally, port ranges are mapped to the classes (Small class: 4001 ~ 4002 vs Large class: 4003 ~ 4020, for example).
- Backend assigns one port number (based on the job size) and bookkeep it so it can keep track of the usage of redis pool.
- Backend regularly pings all redis instances, so it knows their health and do the assignment accordingly.
- Note: The implementation for maintaining the Redis pool can be varied: backend vs k8s controller.

How about this?

ikreymer commented 1 year ago

Now I see what has been considered for the current design.

Initial intention of this ticket is to reduce the start-up time becase it takes few seconds for the service discovery (this is due to the k8s scheduler’s delay that is induced by scheduler/controller picking up and provisioning the resource).

I wonder if there's any way to speed this up, esp. on local deployments? The pods launch quickly, just the service discovery layer that is a little bit slower.

I think we can consider a pre-created redis pool as you mentioned and but we can split it as two classes (or more) depending on the size of the jobs, i.e. the number of pages:

For a single page or few pages jobs (N) (less than M page, e.g. N<=M), uses 10% of the pool as a shared redis instance.

Quick response time for the end-user and not much burdens for the shared instances.

By having at least 2 instances for this class. It can mitigate any single-point of failure situation.

Hm, this could work, but seems like it adds more complexity as there's two types of redis instances now.. Also, estimating pages may be hard, unless its a fixed URL list crawl or there's a fixed limit, we really don't know how many pages a crawl could take.

For a larger pages jobs: N >= M, uses 90% of the pool as an dedicated redis instance (1:1 assignment).

Actually, we can keep the current way of using Redis for this class jobs.

Backend (API) server manages an assignment.

Initially, it creates a number of Redis instances (X) and assign its TCP port from 4001 to 4000+X. Internally, port ranges are mapped to the classes (Small class: 4001 ~ 4002 vs Large class: 4003 ~ 4020, for example).

Hm, I was thinking it would just be a pool of stateful sets, or maybe even a single stateful set, if we wipe the volume for new crawl (maybe that's risky).

Backend assigns one port number (based on the job size) and bookkeep it so it can keep track of the usage of redis pool.

Backend regularly pings all redis instances, so it knows their health and do the assignment accordingly.

The k8s healthcheck can help with this

Note: The implementation for maintaining the Redis pool can be varied: backend vs k8s controller.

I think it could work well with statefulset, and just keeping track of available indexes in a list. The downside is it will be harder to scale down, eg. if ran 100 crawls, but only crawl 3, 4, 97 are still running, won't be able to scale down until index 97 completes.

Or, could just have randomly generated names with deployment / service (my original idea), and backend keeps track of them, and shuts them down if too many. The way it would work is would always have N more than there are crawls running, so can easily assign additional ones to new crawls and start a new one.

The startup take we're talking about is 20-30 seconds, so it shouldn't be too bad, I think!

How about this?

tw4l commented 1 year ago

Reading the StatefulSet documentation, it seems that there may be other options worth considering for reducing the service discovery time without changing how Redis is currently deployed:

"If you need to discover Pods promptly after they are created, you have a few options:

Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which currently caches for 30 seconds).

As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of the pods."

ikreymer commented 1 year ago

Reading the StatefulSet documentation, it seems that there may be other options worth considering for reducing the service discovery time without changing how Redis is currently deployed:

"If you need to discover Pods promptly after they are created, you have a few options:

Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.

Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which currently caches for 30 seconds).

As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of the pods."

Thanks for finding this! Indeed, lowering the coredns cache does speed things up locally, and we could possibly also have the job watch for pods, maybe this is all we need here!!

tw4l commented 1 year ago

Moving back to TODO because the cache changes helped enough for now, but we might still want to keep this on our radar

ikreymer commented 1 year ago

Here's an idea that I think is worth trying, after the operator refactor:

If scale is 1, use the Redis packaged with Browsertrix Crawler
If scale >1, we then switch to a dedicated Redis deployment as we have now.
Redis data is still stored on a separate volume, eg. /redis-data mounted in either Browsertrix Crawler or dedicated Deployment.

This would mean more manual volume management, but should be easier with custom operator. When scaling up from 1 to >1, will need to shutdown crawler, unmount volume, and mount in new Redis deployment. This may potentially be tricky, and will probably slow down scaling >1, but that is less common operation generally. Similarily, when scaling down to 1, will need to wait for all crawlers to shutdown, then restart with local Redis in last container. This would've been fairly hard to do w/o custom operator, but that opens up more possibilities for this type of complex management. If we assume most crawls will start scale of 1, or have a fixed scale that doesn't change, maybe worth exploring..

tw4l commented 1 year ago

When queueing up large amounts of crawls, the redis instances all spin up immediately even for crawls that are still pending, which can result in excessive memory usage. We may want to look into waiting to spin up the redis instance until the crawler is ready to go/no longer in pending.

@ikreymer's suggestion above to use the embedded redis instance in the crawler when scale=1 would also help

webrecorder / browsertrix

Speed up the startup time for a crawl job by reducing Redis service discovery time #447