Open RohitKumarGit opened 2 months ago
CLI output with stats would indeed be useful.
Aside: DB should emit metrics also! CC @mr-karan @kalbhor
Hi @knadh
I’ve been thinking about a few approaches for implementing metrics collection, each with its own advantages. Here are the three options I’ve come up with (apologies in advance for the long read):
We already have some metrics available via API calls like number_of_jobs
and queue_health
We could spin up a thread
on each machine (wherever the workers are running) that ingests these stats into a time-series database like InfluxDB. This DB could then easily integrate with observability tools like Grafana
or Datadog
, which are commonly used in organizations. ( have worked on system like this before)
This approach doesn’t require much work on the database metrics front since we already have observable metrics available through plugins.
This would be similar to how kubectl
works. We could run a kube-apiserver-like
process on each node that returns metrics when queried, just like kubectl
get pods returns pod information.
While intuitive, this option involves a lot of work to develop and maintain especially when you have distrubted machines over IP network
This might be overkill, but I’ve seen an implementation (not open source) and thought about incorporating something similar. The idea is to have a generic debug shell, essentially a thread, that responds to debug commands from any application. The key benefit here is that each application wouldn’t need to build its own CLI for metrics.
For example, say you have a command show statistics serviceA
and show statistics serviceB
(where serviceA could be Dungeebeetle
, and serviceB could be another Zerodha internal process
). The service would inform the debug shell that it supports certain commands and what arguments it requires to return metrics.
The debug shell would then standardize the way this information is fetched, making it reusable across applications without each needing its own dedicated CLI. I know point number 3 in unclear , its difficult to explain this 😓 ( I am also thinking about it lately)
I could work on these suggestions if one suits the purpose , I am confused if something like is this even useful in real world projects or a luxury to have
A debug shell built into the core doesn't make sense. The correct approach is to have HTTP APIs that expose all necessary stats. Then someone can build a CLI shell or a web app etc. that uses the APIs.
Got it,
Can I contribute to this activity in any way if this is in something in pipeline?
On Wed, 25 Sep 2024 at 11:49 AM, Kailash Nadh @.***> wrote:
A debug shell built into the core doesn't make sense. The correct approach is to have HTTP APIs that expose all necessary stats. Then someone can build a CLI shell or a web app etc. that uses the APIs.
— Reply to this email directly, view it on GitHub https://github.com/zerodha/dungbeetle/issues/46#issuecomment-2373127741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOEH5O5EPG7EZF6TTSL5LELZYJIXXAVCNFSM6AAAAABOYABIC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZTGEZDONZUGE . You are receiving this because you authored the thread.Message ID: @.***>
DungBeetle doesn't expose a metrics endpoint. We can integrate the VictoriaMetrics lib (like here: https://github.com/zerodha/kaf-relay/blob/main/main.go) and expose useful metrics for job statuses, task names etc.
Yes makes sense !
@knadh is there any open source work of zerodha where there is scope of contribution?
+1 to HTTP APIs for stats.
@RohitKumarGit if you are interested in adding support maybe can refer to the example @knadh has shared. Eg: source_pool.go and target.go
Separate entities maintain their own metrics set and those are written to a single common metrics http server. On top of my head I think for dungbeetle
the metrics could be added to core
for now.
I am also of the opinion that metrics per registered task can also be useful (but adding this can be up for discussion). We can use parameterized metrics for this, eg : https://github.com/zerodha/kaf-relay/blob/b5059eb0e05a38a3c0fcd2f133f6359f27259ec4/internal/relay/common.go#L29
Hi @kalbhor I was studying the code of kaf-relay ( high level) and here is what I understand
kaf-relay gathers metrics from nodes which has some ID , dungeebeetle is like one node and kaf-relay makes some-sort of http calls to gather metrics and ingests it to some metric server... am I correct?
why we are clubbing metric gathering with kaf-relay , @knadh said we could go with some CLI as dungeebeetle
already exposes metrics over http calls ( I could think of how kubernetes does this with kubectl
)
@kalbhor I guess we could achieve this like
create a seperate project metrics-cli
metric cli gets input in yaml/json/toml like ths
endpoint : [base HTTP endopoint of metric server of any service like dungebeete]
metrics :
url: /cpu_usage
type : INTEGER
unit: %
formatter : Fn()
.... similarlly we define metrics enpoint for any number of processes/elements like dungeebetle
and yes , this config file is similar to what we define in metrics scraping in prometheus
.
When we run the cli tool , it will read this config file and do calls to gather metrics and handling the printing of all metrics as defined in the config
file
this design is extensible and can be used for any future things like dungebeetle!
suggestions please? @knadh @kalbhor
Hm, the suggestion was to refer to the kaf-relay implementation as an example and simply replicate it here in a similar manner. It does not make sense create a separate CLI or create custom handlers. It has to be the standard Prometheus-style metrics like in kaf-relay
okay got it , but then how will we print the output in CLI when we want ? we will have to use some gui like grafana I guess?
anyways for the metrics part, yas I understand ,you are suggesting that we expose prometheus-style
metrics and let metric server like victoriametric do the scraping and management.. correct me If I am wrong please.
I could figure out these metrics to begin with , anything you would like to add?
@knadh
dungbeetle currenly has client api and HTTP endpoints for doing things like
Use Case:
Lets consider case when one of the queue started reporting some error or if a system is running 40 workers but still we see delays in requests.
Now if we want to see
Then we have only the option to do HTTP calls which is not the most efficient way and requires technical knowledge about endpoints etc.
What if we have something and we run command like this
dungeecli workers --wide
Sugegsted output
this would make debugging so easy
And I guess you would have guessed it , it's very similar to
oc
command for openshift orkubectl
for kubernetesExtra info: In our projects we pin processes to particular core , like process1 to run only on core0 , core1. But somehow the core changes for some random reason like crash. In this case the debug shell tells us which process is being run on which core and when it was last changes. Doing this using linux commands was not efficient , hence the debug shell.
@knadh What is your opinion on this? is it something which could be explored?