zendesk / ruby-kafka

A Ruby client library for Apache Kafka
http://www.rubydoc.info/gems/ruby-kafka
Apache License 2.0
1.27k stars 339 forks source link

Offset is getting reset. #554

Closed bradurani closed 6 years ago

bradurani commented 6 years ago

In production, I occasionaly see the offsets for all partitions in a topic take a precipitous nose dive: image These always occur during flurries of group join and group sync events. I'm passing start_from_beginning: false

Occasionally, but not always, a partition will stop getting consumed, and I'l see in the logs:

ERROR -- : Invalid offset for analytic_events_production/7, resetting to default offset

I've since upgraded to 0.5.4. I'll let you know what I see

dasch commented 6 years ago

👍 let's hope it's already fixed.

klippx commented 6 years ago

@bradurani any updates on this?

bradurani commented 6 years ago

Yes, it happened again this weekend: image I'm trying to make sense of the numbers here but I can't. Total lag across all partitions was about 30m image But that only represents about 2.5 days worth of messages, and the cluster is set to 1 week of retention, so I don't think it's trying to consume from the beginning, but I need to do some more work to be sure. Also notice the smaller drop, later in the day? I'm not sure what's happening

bradurani commented 6 years ago

What's your intuition on this one? Is it possible it's a bug in the gem related to compressed batches? I can't think of another way to explain it. The last comment was with 0.5.4 btw

dasch commented 6 years ago

Are you still seeing invalid offset errors in the logs? Would you be able to enable debug logging and see which offset was being requested?

An invalid offset error is returned by the broker if the client asks for an "invalid" offset – to my knowledge, that can only be an offset that is larger than the largest offset in the partition, so it sounds like offsets are being computed incorrectly. It could be some special behavior when compression is enabled – the rules around how to compute offsets have become very complex.

If you can figure out exactly what offset is being fetched from, and correlate with the highwater mark offset for that partition, it would be very helpful.

dasch commented 6 years ago

I've introduced improved logging in the v0.5-stable branch – if you point your Gemfile at that you should be able to see the offset that caused the problem.

bradurani commented 6 years ago

Ok, I've been able to get more visibility. Here's an example log sequence:

, [2018-04-12T06:55:14.586602 #1] ERROR -- : Timed out reading data from server
E, [2018-04-12T06:55:14.586772 #1] ERROR -- : /usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:733:in `rescue in transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/usr/src/app/pipes/base_kinesis_adapter.rb:8:in `perform_track'
/usr/src/app/pipes/base_analytics_adapter.rb:26:in `block in track'
/usr/src/app/pipes/base_analytics_adapter.rb:64:in `block in monitor_send'
/usr/local/bundle/gems/dogstatsd-ruby-3.3.0/lib/datadog/statsd.rb:241:in `time'
/usr/src/app/pipes/base_analytics_adapter.rb:63:in `monitor_send'
/usr/src/app/pipes/base_analytics_adapter.rb:25:in `track'
/usr/src/app/pipes/consumer.rb:72:in `block in consume'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:309:in `block (3 levels) in each_batch'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `block in instrument'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:19:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:33:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:307:in `block (2 levels) in each_batch'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:292:in `each'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:292:in `block in each_batch'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:375:in `block in consumer_loop'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:168:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:19:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:33:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:374:in `consumer_loop'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:284:in `each_batch'
/usr/src/app/pipes/consumer.rb:63:in `consume'
/usr/src/app/pipes/consumer.rb:43:in `start'
/usr/src/app/start:10:in `<main>'
E, [2018-04-12T06:55:14.620553 #1] ERROR -- : Error committing offsets: Kafka::UnknownMemberId
E, [2018-04-12T06:55:14.623089 #1] ERROR -- : Failed to join group; resetting member id and retrying in 1s...
W, [2018-04-12T06:55:22.102364 #1]  WARN -- : Connection has been unused for too long, re-connecting...
W, [2018-04-12T06:55:22.276319 #1]  WARN -- : Connection has been unused for too long, re-connecting...
W, [2018-04-12T06:55:22.439806 #1]  WARN -- : Connection has been unused for too long, re-connecting...

I don't see any OffsetOutOfRange errors

I haven't received ERROR -- : Invalid offset for since I upgraded to 0.5.4

dasch commented 6 years ago

OK. Add a comment if you do see it happen again.

bradurani commented 6 years ago

@dasch thing is even without ERROR -- : Invalid offset for messages, I'm still seeing these huge offset drops. Just today, this is what I got: image I don't understand what causes the offset to drop like that. Any ideas?

dasch commented 6 years ago

@bradurani you're sure there aren't two groups reporting those metrics? I've made that mistake in the past, not separating metrics from two different data centers.

bradurani commented 6 years ago

Ok... I have automatically_mark_as_processed: false and offset_commit_interval is still set at its default of 10 and I'm not calling consumer.commit_offsets. No offsets are being commited. I assumed offset_commit_interval is still respected because the README does not explicitly say its not, but I guess that's not the case 😊 My mistake!

dasch commented 6 years ago

😂