Closed bradurani closed 6 years ago
👍 let's hope it's already fixed.
@bradurani any updates on this?
Yes, it happened again this weekend: I'm trying to make sense of the numbers here but I can't. Total lag across all partitions was about 30m But that only represents about 2.5 days worth of messages, and the cluster is set to 1 week of retention, so I don't think it's trying to consume from the beginning, but I need to do some more work to be sure. Also notice the smaller drop, later in the day? I'm not sure what's happening
What's your intuition on this one? Is it possible it's a bug in the gem related to compressed batches? I can't think of another way to explain it. The last comment was with 0.5.4
btw
Are you still seeing invalid offset errors in the logs? Would you be able to enable debug logging and see which offset was being requested?
An invalid offset error is returned by the broker if the client asks for an "invalid" offset – to my knowledge, that can only be an offset that is larger than the largest offset in the partition, so it sounds like offsets are being computed incorrectly. It could be some special behavior when compression is enabled – the rules around how to compute offsets have become very complex.
If you can figure out exactly what offset is being fetched from, and correlate with the highwater mark offset for that partition, it would be very helpful.
I've introduced improved logging in the v0.5-stable
branch – if you point your Gemfile at that you should be able to see the offset that caused the problem.
Ok, I've been able to get more visibility. Here's an example log sequence:
, [2018-04-12T06:55:14.586602 #1] ERROR -- : Timed out reading data from server
E, [2018-04-12T06:55:14.586772 #1] ERROR -- : /usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:733:in `rescue in transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/usr/src/app/pipes/base_kinesis_adapter.rb:8:in `perform_track'
/usr/src/app/pipes/base_analytics_adapter.rb:26:in `block in track'
/usr/src/app/pipes/base_analytics_adapter.rb:64:in `block in monitor_send'
/usr/local/bundle/gems/dogstatsd-ruby-3.3.0/lib/datadog/statsd.rb:241:in `time'
/usr/src/app/pipes/base_analytics_adapter.rb:63:in `monitor_send'
/usr/src/app/pipes/base_analytics_adapter.rb:25:in `track'
/usr/src/app/pipes/consumer.rb:72:in `block in consume'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:309:in `block (3 levels) in each_batch'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `block in instrument'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications/instrumenter.rb:21:in `instrument'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:166:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:19:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:33:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:307:in `block (2 levels) in each_batch'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:292:in `each'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:292:in `block in each_batch'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:375:in `block in consumer_loop'
/usr/local/bundle/gems/activesupport-5.1.4/lib/active_support/notifications.rb:168:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:19:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/instrumenter.rb:33:in `instrument'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:374:in `consumer_loop'
/usr/local/bundle/gems/ruby-kafka-0.5.4/lib/kafka/consumer.rb:284:in `each_batch'
/usr/src/app/pipes/consumer.rb:63:in `consume'
/usr/src/app/pipes/consumer.rb:43:in `start'
/usr/src/app/start:10:in `<main>'
E, [2018-04-12T06:55:14.620553 #1] ERROR -- : Error committing offsets: Kafka::UnknownMemberId
E, [2018-04-12T06:55:14.623089 #1] ERROR -- : Failed to join group; resetting member id and retrying in 1s...
W, [2018-04-12T06:55:22.102364 #1] WARN -- : Connection has been unused for too long, re-connecting...
W, [2018-04-12T06:55:22.276319 #1] WARN -- : Connection has been unused for too long, re-connecting...
W, [2018-04-12T06:55:22.439806 #1] WARN -- : Connection has been unused for too long, re-connecting...
I don't see any OffsetOutOfRange
errors
I haven't received ERROR -- : Invalid offset for
since I upgraded to 0.5.4
OK. Add a comment if you do see it happen again.
@dasch thing is even without ERROR -- : Invalid offset for
messages, I'm still seeing these huge offset drops. Just today, this is what I got:
I don't understand what causes the offset to drop like that. Any ideas?
@bradurani you're sure there aren't two groups reporting those metrics? I've made that mistake in the past, not separating metrics from two different data centers.
Ok... I have automatically_mark_as_processed: false
and offset_commit_interval
is still set at its default of 10 and I'm not calling consumer.commit_offsets
. No offsets are being commited. I assumed offset_commit_interval
is still respected because the README does not explicitly say its not, but I guess that's not the case 😊
My mistake!
😂
In production, I occasionaly see the offsets for all partitions in a topic take a precipitous nose dive: These always occur during flurries of group join and group sync events. I'm passing
start_from_beginning: false
Occasionally, but not always, a partition will stop getting consumed, and I'l see in the logs:
ERROR -- : Invalid offset for analytic_events_production/7, resetting to default offset
I've since upgraded to
0.5.4
. I'll let you know what I see