Open busecolak opened 4 years ago
Hi @busecolak, thanks a lot for the report. The other day I was trying to gather metrics related to how many metrics Druid is emitting to the exporter, but I think that those are only available in more recent versions (I run 0.12.3 sadly, but hopefully will migrate soon).
I am reading https://druid.apache.org/docs/latest/configuration/index.html#emitting-metrics, and I'd be curious to know if you have an idea about how big are your batches of datapoints, and how frequent they are. As we discussed previously the exporter might need some performance enhancement for big workloads, so it would be good to have an idea about what is the volume of datapoints that your druid daemons are emitting.
Moreover, I'd also test with lowering druid.emitter.http.flushMillis
, in 300s there could be a ton of datapoints accumulated, and 100ms of timeout could be too tight. Maybe more frequent flush intervals (like once every 30s/60s) could help.
Let me know!
There are other two interesting settings:
druid.emitter.http.flushCount
- How many messages the internal message buffer can hold before flushing (sending).
druid.emitter.http.maxBatchSize
- The maximum batch size, in bytes.
So in theory, if you didn't change the latter, the max batch size is 500. I assume that when it gets to that number of datapoints, then it forces a flush (even if the 300s are not reached). How big are those datapoints though? Does it happen with specific daemons or with all?
Another thing that I'd do would be to lower down the max batch size to something like 100, and see how it goes.
@busecolak one thing that could be useful is getting some info about your workload with a graph like:
The emitter publishes a metric called druid_exporter_datapoints_registered_total
, this could be a good indicator about your workload..
@elukey thanks for the replies.
The errors occur mostly with MiddleManager and Historical daemons, but most of them have the same issue.
For MiddleManagers, druid_exporter_datapoints_registered_total
is around 4-5 millions. But I am not sure how to map this to the batch size of the emitter.
@busecolak can you put a irate..[5m] or similar and report the rps value? Just to have an idea of the volume and how it varies.
@busecolak any news?
Hi guys,
I've faced the same issue.
irate(druid_exporter_datapoints_registered_total[5m])
@freemanlutsk thanks a lot for the feedback, my use case is way less datapoints/s so the exporter was surely not designed for such big use cases. The idea that I have in mind is to see if we can make async the HTTP ingestion of the datapoints, push them to a python queue and then let the exporter process data with a different pace. Will try to research something, if anybody has ideas or suggestions based on experience I am all ears :)
Ok I have to admit that I didn't really check at the time how many threads the uwsgi make_server call would create. It seems that it creates only one, that serves requests synchronously, so the current issue is well explained.
Would you be able to fix it ? :) Thank You!
Yes yes I am currently working on it, I think I found a solution that scales more, I'll try to file a pull request asap. After that any test and user report would be really appreciated :)
First attempt: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/599011/
Currently testing it but so far preliminary results are very good :)
@freemanlutsk any chance that you could apply the above patch and test how things look like?
Sure. I will test it in 1 hour. All changes in the master?
@freemanlutsk not yet, the code is ready and I tested it, but before committing I wanted some feedback if possible, this is why I was asking (so you'd need to take the above patch and apply it manually on the current master). If not possible not problem!
I've deployed yours fixes into my environment. Seems like it works as expected. Would be good to keep it working couple of hours. I'll be back with result tomorrow :) Many thanks for quick response!
Unfortunately problem still exists (
2020-05-28T06:00:36,172 ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [33] ms, timeout [66] ms
2020-05-28T06:00:36,172 WARN [HttpPostEmitter-1] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 884ms.
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Request timeout to localhost.localdomain/127.0.0.1:8000 after 66 ms
@freemanlutsk thanks a lot for the report and the testing.
I guess that the patch alleviates the problem but we still see some issue right?
What I'd like to check now is if some tuning is needed on the Druid front. There are a lot of tuning parameters in: https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module
Do you have any special config? If I am reading the logs correct they say timeout [66] ms
, that is very tight, meanwhile the default is 60s.
Also, how many druid daemons are pushing to the same druid exporter? At some point one solution to consider would be to have one or more "dedicated" instances for high-traffic daemons (say historical, middle managers), like @busecolak mentioned before.
https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module
Unfortunately I cannot easily play with those parameters. Currently Broker, Historical, Router, MiddleManager - are storing metrics to the same exporter. (I'm on the way to changing this architecture). Basically problems appear after adding the Historical daemon.
@freemanlutsk thanks, let me know how it goes. Today I was thinking that beyond a certain limit, we should rely on Kafka. I discovered today the kafka emitter, so I have in mind to add a flag to this exporer to just pulling from kafka if requested (instead of ingesting datapoints directly from Druid). Would it be something usable for you? In this way the exporter would not be concerned about traffic from Druid daemons, and we'd re-use something battle tested for it (namely Kafka).
Sounds good! Let's try, I'm willing to testing.
Ok so I'll try to schedule some time during the next days to work on this, shouldn't take much. In the meantime, if you could test using a dedicated exporter only for the historical daemon it would be great :)
@busecolak have you tried the new version by any chance? If so, how does it look in your environment?
The idea that I have in mind is the following: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/600295/
Still not tested, I need to verify the format of datapoints in kafka etc.., hopefully will be able to do it during the next days.
@freemanlutsk I had some time to test the above patch and it seems working, but I wasn't able to set Druid to use the Kafka emitter extension since I don't have it packaged yet. I manually pushed some json content to a kafka topic and it works, so if Druid doesn't do anything peculiar when emitting datapoints to Kafka I'd say that it should work fine. If you have time/patience to test the above and let me know if it is an acceptable/performant solution for you I'd be grateful.
Great! Thank You! In 2-3 weeks, I'm going to change my Druid architecture. After that I will be able to restart my nodes with changed configs without any problems and to do test with production load.
Hi @elukey,
Sorry for delay.
I’ve enabled emitting metrics to the Kafka cluster with clarity-kafka
extension help.
Format is:
{"feed":"metrics","timestamp":"2020-07-23T15:19:30.444Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/user","value":1410,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/percent","value":0.024666666666666667,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/sys","value":70,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.446Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/threads/livePeak","value":123,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"}
Will it work with your exporter ?
Thank You!
@freemanlutsk in theory yes, but better to test it before enabling it for Production use cases. In case you encounter any error please report it in here and I'll try to fix.
@elukey
Is that a right way to run an exporter ?
It seems something went wrong, because exposed port (8000) does not work.
docker run --rm -v /root/druid.json:/exporter/conf/druid.json registry.company.com/bidmachine/druid-exporter:kafka_v1 -t bidmachine-druid-metrics-stage -b kafka1-ext.example.com:9092 kafka2-ext.example.com:9092 kafka3-ext.example.com:9092 kafka4-ext.example.com:9092 kafka5-ext.example.com:9092 kafka6-ext.example.com:9092 kafka7-ext.example.com:9092 kafka8-ext.example.com:9092 kafka9-ext.example.com:9092 -g druid-metrics-stage -d conf/druid.json
INFO:__main__:Using Kafka config: {'topic': 'bidmachine-druid-metrics-stage', 'bootstrap_servers': ['kafka1-ext.example.com:9092', 'kafka2-ext.example.com:9092', 'kafka3-ext.example.com:9092', 'kafka4-ext.example.com:9092', 'kafka5-ext.example.com:9092', 'kafka6-ext.example.com:9092', 'kafka7-ext.example.com:9092', 'kafka8-ext.example.com:9092', 'kafka9-ext.example.com:9092'], 'group_id': 'druid-metrics-stage'}
INFO:__main__:Starting druid_exporter on :8000
INFO:__main__:Reading metrics configuration from conf/druid.json
INFO:__main__:Checking consistency of metrics config file..
DEBUG:druid_exporter.collector:Process datapoints thread starting..
@freemanlutsk what issue do you see? Timeout while contacting the port 8000? Or something different? I'd check if port 8000 is correctly exposed by your docker container to localhost (where I guess you are trying to test it). Also you need to push datapoints to the topic to make it process something, since by default the kafka client starts from the tail of the topic. Thanks for testing!
netstat -antp|grep LIST|grep 8000
is empty.
curl 127.0.0.1:8000/metrics
curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused
@elukey
oh it was my mistake, —net host
nevermind to the previous my messages )
@elukey
It seems consumer does not grab a data from the Kafka stream (kafkacat shows there are lot of metrics in the topic).
Also I noticed, after starting the consumer (exporter), there was not created a consumer group in the Kafka cluster.
Perhaps problem with the clarity-kafka
Please, advice. Thank You!
@freemanlutsk do you see, among the debug logs, something like: Kafka datapoints puller thread starting..
?
If not, I think that the kafka-python dependency is not pulled in, ending up in https://github.com/wikimedia/operations-software-druid_exporter/blob/master/druid_exporter/collector.py#L27-L30. I haven't updated the Dockerfile yet, are you including kafka-python in the one that you are using by any chance?
If this is the problem maybe I can add a log to inform the user of what is happening, like Looks like you added a kafka config but your Kafka client dependencies are not deployed, so I'll skip to create the kafka puller thread..
@elukey Thanks! kafka-python
was cause the previous error.
Would you be able to look through another error ?
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=8) with offset 410380 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=1) with offset 410377 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=2) with offset 410370 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=7) with offset 410371 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=5) with offset 410372 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=6) with offset 410368 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=4) with offset 410377 to buffered record list
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 298, in pull_datapoints_from_kafka
json_message = json.loads(message.value.decode())
NameError: name 'json' is not defined
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 305, in pull_datapoints_from_kafka
except json.JSONDecodeError:
NameError: name 'json' is not defined
@freemanlutsk this is a bug, can you try the last version of the code? (just committed a fix)
Thank You! Looks like all works fine!
But coordinator's metrics are grabbing 3 times (if you have 3 masters). For metrics from masters we have to use some functions like avg
.
And I should try, how will feel an exporter in the production env.
@freemanlutsk thanks a lot for the tests! Yes the coordinator metrics are still something that I haven't solved, only one coordinator is the master at any given time, that publishes metrics. Due to any number of reasons, it can swap anytime (say due to a restart/problem/etc..).
Keep also in mind https://github.com/wikimedia/operations-software-druid_exporter#known-limitations, that I still haven't solved. Basically the main issue is that when the coordinator master moves from one host to the other one, then due to how prometheus work both will keep returning metrics: the "former" master will keep returning the last metrics state, and the new one will emit the new metrics. I have in mind some workarounds, but still didn't find a good compromise in the code. If it turns to be a problem for you I'll try to prioritize 👍
Hello,
I am using Druid 0.18.0 with this exporter. There is one druid exporter instance for each druid component. Exporter seems to having trouble catching up on emitter. The exporter causes some druid components to pop errors related to emitter.
Druid emitter configuration is as follows:
In exporter logs, I see consecutive Post requests even if the emitter configuration as above.
[20/May/2020 06:39:36] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:45] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
[20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0
Some error log on Druid components:
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [25] ms, timeout [50] ms
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Failed to send events to url
ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - failedBuffers queue size reached the limit [50], dropping the oldest failed buffer
Is there any suggestion, how to resolve this?