wikimedia / operations-software-druid_exporter

This is a mirror from https://gerrit.wikimedia.org. See https://www.mediawiki.org/wiki/Developer_access for contributing.
Apache License 2.0
39 stars 26 forks source link

Emitter Timeout Errors #11

Open busecolak opened 4 years ago

busecolak commented 4 years ago

Hello,

I am using Druid 0.18.0 with this exporter. There is one druid exporter instance for each druid component. Exporter seems to having trouble catching up on emitter. The exporter causes some druid components to pop errors related to emitter.

Druid emitter configuration is as follows:

In exporter logs, I see consecutive Post requests even if the emitter configuration as above. [20/May/2020 06:39:36] "POST / HTTP/1.1" 200 0 [20/May/2020 06:39:45] "POST / HTTP/1.1" 200 0 [20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0 [20/May/2020 06:39:56] "POST / HTTP/1.1" 200 0

Some error log on Druid components: ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [25] ms, timeout [50] ms ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Failed to send events to url ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - failedBuffers queue size reached the limit [50], dropping the oldest failed buffer

Is there any suggestion, how to resolve this?

elukey commented 4 years ago

Hi @busecolak, thanks a lot for the report. The other day I was trying to gather metrics related to how many metrics Druid is emitting to the exporter, but I think that those are only available in more recent versions (I run 0.12.3 sadly, but hopefully will migrate soon).

I am reading https://druid.apache.org/docs/latest/configuration/index.html#emitting-metrics, and I'd be curious to know if you have an idea about how big are your batches of datapoints, and how frequent they are. As we discussed previously the exporter might need some performance enhancement for big workloads, so it would be good to have an idea about what is the volume of datapoints that your druid daemons are emitting.

Moreover, I'd also test with lowering druid.emitter.http.flushMillis, in 300s there could be a ton of datapoints accumulated, and 100ms of timeout could be too tight. Maybe more frequent flush intervals (like once every 30s/60s) could help.

Let me know!

elukey commented 4 years ago

There are other two interesting settings:

So in theory, if you didn't change the latter, the max batch size is 500. I assume that when it gets to that number of datapoints, then it forces a flush (even if the 300s are not reached). How big are those datapoints though? Does it happen with specific daemons or with all?

Another thing that I'd do would be to lower down the max batch size to something like 100, and see how it goes.

elukey commented 4 years ago

@busecolak one thing that could be useful is getting some info about your workload with a graph like:

https://grafana.wikimedia.org/d/000000538/druid?panelId=62&fullscreen&orgId=1&refresh=1m&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All

The emitter publishes a metric called druid_exporter_datapoints_registered_total, this could be a good indicator about your workload..

busecolak commented 4 years ago

@elukey thanks for the replies.

The errors occur mostly with MiddleManager and Historical daemons, but most of them have the same issue.

For MiddleManagers, druid_exporter_datapoints_registered_total is around 4-5 millions. But I am not sure how to map this to the batch size of the emitter.

elukey commented 4 years ago

@busecolak can you put a irate..[5m] or similar and report the rps value? Just to have an idea of the volume and how it varies.

elukey commented 4 years ago

@busecolak any news?

freemanlutsk commented 4 years ago

Hi guys, I've faced the same issue. irate(druid_exporter_datapoints_registered_total[5m]) image

elukey commented 4 years ago

@freemanlutsk thanks a lot for the feedback, my use case is way less datapoints/s so the exporter was surely not designed for such big use cases. The idea that I have in mind is to see if we can make async the HTTP ingestion of the datapoints, push them to a python queue and then let the exporter process data with a different pace. Will try to research something, if anybody has ideas or suggestions based on experience I am all ears :)

elukey commented 4 years ago

Ok I have to admit that I didn't really check at the time how many threads the uwsgi make_server call would create. It seems that it creates only one, that serves requests synchronously, so the current issue is well explained.

freemanlutsk commented 4 years ago

Would you be able to fix it ? :) Thank You!

elukey commented 4 years ago

Yes yes I am currently working on it, I think I found a solution that scales more, I'll try to file a pull request asap. After that any test and user report would be really appreciated :)

elukey commented 4 years ago

First attempt: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/599011/

Currently testing it but so far preliminary results are very good :)

elukey commented 4 years ago

@freemanlutsk any chance that you could apply the above patch and test how things look like?

freemanlutsk commented 4 years ago

Sure. I will test it in 1 hour. All changes in the master?

elukey commented 4 years ago

@freemanlutsk not yet, the code is ready and I tested it, but before committing I wanted some feedback if possible, this is why I was asking (so you'd need to take the above patch and apply it manually on the current master). If not possible not problem!

freemanlutsk commented 4 years ago

I've deployed yours fixes into my environment. Seems like it works as expected. Would be good to keep it working couple of hours. I'll be back with result tomorrow :) Many thanks for quick response!

freemanlutsk commented 4 years ago

Unfortunately problem still exists (

2020-05-28T06:00:36,172 ERROR [HttpPostEmitter-1] org.apache.druid.java.util.emitter.core.HttpPostEmitter - Timing out emitter batch send, last batch fill time [33] ms, timeout [66] ms
2020-05-28T06:00:36,172 WARN [HttpPostEmitter-1] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 884ms.
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Request timeout to localhost.localdomain/127.0.0.1:8000 after 66 ms
elukey commented 4 years ago

@freemanlutsk thanks a lot for the report and the testing.

I guess that the patch alleviates the problem but we still see some issue right?

What I'd like to check now is if some tuning is needed on the Druid front. There are a lot of tuning parameters in: https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Do you have any special config? If I am reading the logs correct they say timeout [66] ms, that is very tight, meanwhile the default is 60s.

elukey commented 4 years ago

Also, how many druid daemons are pushing to the same druid exporter? At some point one solution to consider would be to have one or more "dedicated" instances for high-traffic daemons (say historical, middle managers), like @busecolak mentioned before.

freemanlutsk commented 4 years ago

https://druid.apache.org/docs/latest/configuration/index.html#http-emitter-module

Unfortunately I cannot easily play with those parameters. Currently Broker, Historical, Router, MiddleManager - are storing metrics to the same exporter. (I'm on the way to changing this architecture). Basically problems appear after adding the Historical daemon.

elukey commented 4 years ago

@freemanlutsk thanks, let me know how it goes. Today I was thinking that beyond a certain limit, we should rely on Kafka. I discovered today the kafka emitter, so I have in mind to add a flag to this exporer to just pulling from kafka if requested (instead of ingesting datapoints directly from Druid). Would it be something usable for you? In this way the exporter would not be concerned about traffic from Druid daemons, and we'd re-use something battle tested for it (namely Kafka).

freemanlutsk commented 4 years ago

Sounds good! Let's try, I'm willing to testing.

elukey commented 4 years ago

Ok so I'll try to schedule some time during the next days to work on this, shouldn't take much. In the meantime, if you could test using a dedicated exporter only for the historical daemon it would be great :)

elukey commented 4 years ago

Added https://github.com/wikimedia/operations-software-druid_exporter#performance-considerations

elukey commented 4 years ago

@busecolak have you tried the new version by any chance? If so, how does it look in your environment?

elukey commented 4 years ago

The idea that I have in mind is the following: https://gerrit.wikimedia.org/r/#/c/operations/software/druid_exporter/+/600295/

Still not tested, I need to verify the format of datapoints in kafka etc.., hopefully will be able to do it during the next days.

elukey commented 4 years ago

@freemanlutsk I had some time to test the above patch and it seems working, but I wasn't able to set Druid to use the Kafka emitter extension since I don't have it packaged yet. I manually pushed some json content to a kafka topic and it works, so if Druid doesn't do anything peculiar when emitting datapoints to Kafka I'd say that it should work fine. If you have time/patience to test the above and let me know if it is an acceptable/performant solution for you I'd be grateful.

freemanlutsk commented 4 years ago

Great! Thank You! In 2-3 weeks, I'm going to change my Druid architecture. After that I will be able to restart my nodes with changed configs without any problems and to do test with production load.

freemanlutsk commented 4 years ago

Hi @elukey, Sorry for delay. I’ve enabled emitting metrics to the Kafka cluster with clarity-kafka extension help. Format is: {"feed":"metrics","timestamp":"2020-07-23T15:19:30.444Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/user","value":1410,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/percent","value":0.024666666666666667,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.445Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/cpu/sys","value":70,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"} {"feed":"metrics","timestamp":"2020-07-23T15:19:30.446Z","service":"druid/middlemanager","host":"druid-backup2-int.company.com:8102","version":"0.18.0-iap12","metric":"jvm/threads/livePeak","value":123,"dataSource":["company-errors"],"id":["index_kafka_company-errors_8825d8e2b4415f1_cpidiogh"],"implyNodeType":"druid/peon","implyCluster":"clarity-collection-cluster","implyDruidVersion":"0.18.0-iap12","implyVersion":"3.3.7"}

Will it work with your exporter ?

Thank You!

elukey commented 4 years ago

@freemanlutsk in theory yes, but better to test it before enabling it for Production use cases. In case you encounter any error please report it in here and I'll try to fix.

freemanlutsk commented 4 years ago

@elukey
Is that a right way to run an exporter ? It seems something went wrong, because exposed port (8000) does not work.

docker run --rm -v /root/druid.json:/exporter/conf/druid.json registry.company.com/bidmachine/druid-exporter:kafka_v1 -t bidmachine-druid-metrics-stage -b kafka1-ext.example.com:9092 kafka2-ext.example.com:9092 kafka3-ext.example.com:9092 kafka4-ext.example.com:9092 kafka5-ext.example.com:9092 kafka6-ext.example.com:9092 kafka7-ext.example.com:9092 kafka8-ext.example.com:9092 kafka9-ext.example.com:9092 -g druid-metrics-stage  -d conf/druid.json
INFO:__main__:Using Kafka config: {'topic': 'bidmachine-druid-metrics-stage', 'bootstrap_servers': ['kafka1-ext.example.com:9092', 'kafka2-ext.example.com:9092', 'kafka3-ext.example.com:9092', 'kafka4-ext.example.com:9092', 'kafka5-ext.example.com:9092', 'kafka6-ext.example.com:9092', 'kafka7-ext.example.com:9092', 'kafka8-ext.example.com:9092', 'kafka9-ext.example.com:9092'], 'group_id': 'druid-metrics-stage'}
INFO:__main__:Starting druid_exporter on :8000
INFO:__main__:Reading metrics configuration from conf/druid.json
INFO:__main__:Checking consistency of metrics config file..
DEBUG:druid_exporter.collector:Process datapoints thread starting..
elukey commented 4 years ago

@freemanlutsk what issue do you see? Timeout while contacting the port 8000? Or something different? I'd check if port 8000 is correctly exposed by your docker container to localhost (where I guess you are trying to test it). Also you need to push datapoints to the topic to make it process something, since by default the kafka client starts from the tail of the topic. Thanks for testing!

freemanlutsk commented 4 years ago

netstat -antp|grep LIST|grep 8000 is empty.

curl 127.0.0.1:8000/metrics
curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused

@elukey

freemanlutsk commented 4 years ago

oh it was my mistake, —net host nevermind to the previous my messages )

freemanlutsk commented 4 years ago

@elukey It seems consumer does not grab a data from the Kafka stream (kafkacat shows there are lot of metrics in the topic). Also I noticed, after starting the consumer (exporter), there was not created a consumer group in the Kafka cluster.
Perhaps problem with the clarity-kafka

Please, advice. Thank You!

elukey commented 4 years ago

@freemanlutsk do you see, among the debug logs, something like: Kafka datapoints puller thread starting.. ?

If not, I think that the kafka-python dependency is not pulled in, ending up in https://github.com/wikimedia/operations-software-druid_exporter/blob/master/druid_exporter/collector.py#L27-L30. I haven't updated the Dockerfile yet, are you including kafka-python in the one that you are using by any chance?

If this is the problem maybe I can add a log to inform the user of what is happening, like Looks like you added a kafka config but your Kafka client dependencies are not deployed, so I'll skip to create the kafka puller thread..

freemanlutsk commented 4 years ago

@elukey Thanks! kafka-python was cause the previous error. Would you be able to look through another error ?

DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=8) with offset 410380 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=1) with offset 410377 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=2) with offset 410370 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=7) with offset 410371 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=5) with offset 410372 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=6) with offset 410368 to buffered record list
DEBUG:kafka.consumer.fetcher:Adding fetched record for partition TopicPartition(topic='druid-metrics-stage', partition=4) with offset 410377 to buffered record list
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 298, in pull_datapoints_from_kafka
    json_message = json.loads(message.value.decode())
NameError: name 'json' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/site-packages/druid_exporter-0.10-py3.8.egg/druid_exporter/collector.py", line 305, in pull_datapoints_from_kafka
    except json.JSONDecodeError:
NameError: name 'json' is not defined
elukey commented 4 years ago

@freemanlutsk this is a bug, can you try the last version of the code? (just committed a fix)

freemanlutsk commented 4 years ago

Thank You! Looks like all works fine!

But coordinator's metrics are grabbing 3 times (if you have 3 masters). For metrics from masters we have to use some functions like avg.

And I should try, how will feel an exporter in the production env. image

elukey commented 4 years ago

@freemanlutsk thanks a lot for the tests! Yes the coordinator metrics are still something that I haven't solved, only one coordinator is the master at any given time, that publishes metrics. Due to any number of reasons, it can swap anytime (say due to a restart/problem/etc..).

Keep also in mind https://github.com/wikimedia/operations-software-druid_exporter#known-limitations, that I still haven't solved. Basically the main issue is that when the coordinator master moves from one host to the other one, then due to how prometheus work both will keep returning metrics: the "former" master will keep returning the last metrics state, and the new one will emit the new metrics. I have in mind some workarounds, but still didn't find a good compromise in the code. If it turns to be a problem for you I'll try to prioritize 👍