zerodha / dungbeetle

A distributed job server built specifically for queuing and executing heavy SQL read jobs asynchronously. Separate out reporting layer from apps. MySQL, Postgres, ClickHouse.
MIT License
537 stars 76 forks source link

[Design Feature Request for performance] on-demand scaling of workers with redundancy #47

Closed RohitKumarGit closed 1 month ago

RohitKumarGit commented 1 month ago

Hi @knadh I was studying the design of dungbeetle and I realized a scenario where this design might see bottleneck as per my understanding

As per my understanding the workers fetch one job from the QUEUE and then write it to primary_db and result_db ( for frequent operations and then removed after a time-to-live time ) and result_db is a single one ( chosed from list randomly )which we declare when we start the process in config file

Now consider the worst-case like this

  1. You have N number of requests coming in which is much higher than anticipated
  2. This leads to large number of operations even on result_db
  3. Because of this even the result_db starts facing bottleneck
  4. Because of this writes might become slower on DB ( I consider secondary DB of not of that big capacity) degrading overall perfomance

In short result_db is like a single point of failure irrespective of number of workers we have

Proposed Solution

  1. Broker selects one result_db at start
  2. When a result_db encounters some kind of congestion it feedbacks it to the broker.
  3. Broker the feedback crosses a threshhold , the broker does a health check on its result_db options and makes it as primary result database
  4. Broker now writes the new fresh job IDS to this new primary result database

if we implement this feedback on brokers then autoscalers like in kubernetes could use it to scale our workers on a particular queue. for this we need to implement a way to accept workers and create task tables just like redhat openshift does this for it's worker nodes

knadh commented 1 month ago

DB load balancing is a very complex task and it is not viable (or necessary) to add to the core.

This scenario can be solved easily with many approaches.

a) Launch multiple DungBeetle instances that speak to different result DBs and load/queue balance across them. b) Put a robust DB load balancing solution between DungBeetle and the results DB, eg: pgBouncer.