zerodha / dungbeetle

A distributed job server built specifically for queuing and executing heavy SQL read jobs asynchronously. Separate out reporting layer from apps. MySQL, Postgres, ClickHouse.
MIT License
541 stars 77 forks source link

[Feature Request] Debug Shell for dungbeetle #46

Open RohitKumarGit opened 2 months ago

RohitKumarGit commented 2 months ago

dungbeetle currenly has client api and HTTP endpoints for doing things like

Use Case:

Lets consider case when one of the queue started reporting some error or if a system is running 40 workers but still we see delays in requests.

Now if we want to see

Then we have only the option to do HTTP calls which is not the most efficient way and requires technical knowledge about endpoints etc.

What if we have something and we run command like this

dungeecli workers --wide

Sugegsted output

worker   queue length last_exectution_time  health

worker1     10               2.5s                          OK

worker2     20               3.5s                           NOT_OK

this would make debugging so easy

And I guess you would have guessed it , it's very similar to oc command for openshift or kubectl for kubernetes

Extra info: In our projects we pin processes to particular core , like process1 to run only on core0 , core1. But somehow the core changes for some random reason like crash. In this case the debug shell tells us which process is being run on which core and when it was last changes. Doing this using linux commands was not efficient , hence the debug shell.

@knadh What is your opinion on this? is it something which could be explored?

knadh commented 2 months ago

CLI output with stats would indeed be useful.

Aside: DB should emit metrics also! CC @mr-karan @kalbhor

RohitKumarGit commented 2 months ago

Hi @knadh

I’ve been thinking about a few approaches for implementing metrics collection, each with its own advantages. Here are the three options I’ve come up with (apologies in advance for the long read):

1. Push-based metric collection (Quickest to implement)

We already have some metrics available via API calls like number_of_jobsand queue_health We could spin up a thread on each machine (wherever the workers are running) that ingests these stats into a time-series database like InfluxDB. This DB could then easily integrate with observability tools like Grafana or Datadog, which are commonly used in organizations. ( have worked on system like this before)

This approach doesn’t require much work on the database metrics front since we already have observable metrics available through plugins.

2. CLI Client (More intuitive)

This would be similar to how kubectlworks. We could run a kube-apiserver-like process on each node that returns metrics when queried, just like kubectl get pods returns pod information.

While intuitive, this option involves a lot of work to develop and maintain especially when you have distrubted machines over IP network

3. Generic Debug Shell (A bit complex, but versatile)

This might be overkill, but I’ve seen an implementation (not open source) and thought about incorporating something similar. The idea is to have a generic debug shell, essentially a thread, that responds to debug commands from any application. The key benefit here is that each application wouldn’t need to build its own CLI for metrics.

For example, say you have a command show statistics serviceAand show statistics serviceB(where serviceA could be Dungeebeetle, and serviceB could be another Zerodha internal process). The service would inform the debug shell that it supports certain commands and what arguments it requires to return metrics.

The debug shell would then standardize the way this information is fetched, making it reusable across applications without each needing its own dedicated CLI. I know point number 3 in unclear , its difficult to explain this 😓 ( I am also thinking about it lately)

I could work on these suggestions if one suits the purpose , I am confused if something like is this even useful in real world projects or a luxury to have

knadh commented 2 months ago

A debug shell built into the core doesn't make sense. The correct approach is to have HTTP APIs that expose all necessary stats. Then someone can build a CLI shell or a web app etc. that uses the APIs.

RohitKumarGit commented 2 months ago

Got it,

Can I contribute to this activity in any way if this is in something in pipeline?

On Wed, 25 Sep 2024 at 11:49 AM, Kailash Nadh @.***> wrote:

A debug shell built into the core doesn't make sense. The correct approach is to have HTTP APIs that expose all necessary stats. Then someone can build a CLI shell or a web app etc. that uses the APIs.

— Reply to this email directly, view it on GitHub https://github.com/zerodha/dungbeetle/issues/46#issuecomment-2373127741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOEH5O5EPG7EZF6TTSL5LELZYJIXXAVCNFSM6AAAAABOYABIC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZTGEZDONZUGE . You are receiving this because you authored the thread.Message ID: @.***>

knadh commented 2 months ago

DungBeetle doesn't expose a metrics endpoint. We can integrate the VictoriaMetrics lib (like here: https://github.com/zerodha/kaf-relay/blob/main/main.go) and expose useful metrics for job statuses, task names etc.

RohitKumarGit commented 2 months ago

Yes makes sense !

@knadh is there any open source work of zerodha where there is scope of contribution?

kalbhor commented 2 months ago

+1 to HTTP APIs for stats.

@RohitKumarGit if you are interested in adding support maybe can refer to the example @knadh has shared. Eg: source_pool.go and target.go

Separate entities maintain their own metrics set and those are written to a single common metrics http server. On top of my head I think for dungbeetle the metrics could be added to core for now.

I am also of the opinion that metrics per registered task can also be useful (but adding this can be up for discussion). We can use parameterized metrics for this, eg : https://github.com/zerodha/kaf-relay/blob/b5059eb0e05a38a3c0fcd2f133f6359f27259ec4/internal/relay/common.go#L29

RohitKumarGit commented 2 months ago

Hi @kalbhor I was studying the code of kaf-relay ( high level) and here is what I understand

kaf-relay gathers metrics from nodes which has some ID , dungeebeetle is like one node and kaf-relay makes some-sort of http calls to gather metrics and ingests it to some metric server... am I correct?

why we are clubbing metric gathering with kaf-relay , @knadh said we could go with some CLI as dungeebeetle already exposes metrics over http calls ( I could think of how kubernetes does this with kubectl )

RohitKumarGit commented 2 months ago

@kalbhor I guess we could achieve this like

create a seperate project metrics-cli

metric cli gets input in yaml/json/toml like ths

endpoint : [base HTTP endopoint of metric server of any service like dungebeete]
metrics :
    url: /cpu_usage
    type : INTEGER
    unit: %
    formatter : Fn()

.... similarlly we define metrics enpoint for any number of processes/elements like dungeebetle

and yes , this config file is similar to what we define in metrics scraping in prometheus.

When we run the cli tool , it will read this config file and do calls to gather metrics and handling the printing of all metrics as defined in the config file

this design is extensible and can be used for any future things like dungebeetle!

suggestions please? @knadh @kalbhor

knadh commented 2 months ago

Hm, the suggestion was to refer to the kaf-relay implementation as an example and simply replicate it here in a similar manner. It does not make sense create a separate CLI or create custom handlers. It has to be the standard Prometheus-style metrics like in kaf-relay

RohitKumarGit commented 2 months ago

okay got it , but then how will we print the output in CLI when we want ? we will have to use some gui like grafana I guess?

anyways for the metrics part, yas I understand ,you are suggesting that we expose prometheus-style metrics and let metric server like victoriametric do the scraping and management.. correct me If I am wrong please.

I could figure out these metrics to begin with , anything you would like to add?

  1. Number of errored jobs in a queue
  2. number of jobs in queue
  3. where the worker is runnign
  4. healthcheck of workers
  5. number of workers reachable by broker
  6. priority of worker threads

@knadh