Open Hammadk opened 3 years ago
If your session timeout is 5 minutes, that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.
What is the offset commit config for your consumers?
Also: by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.
If your session timeout is 5 minutes that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.
The odd thing is that the consumer reverted back to an offset that it successfully processed 17 hours ago. This large change in offset is outside standard re-processing during rebalancing. I wonder if the consumer is going back to the beginning of the partition. Is there a bug or flow that could cause this? Could you point me to the relevant rebalancing logic?
Note that I don't have a good way of identifying the first offset for this partition, at the time of the incident. Technically, our retention policy is 3 days, so if this went back to the beginning, it should have gone even further back.
What is the offset commit config for your consumers?
We don't custom offset commit config, and we use the default offset_commit_interval 10 seconds config from Racecar. Is that the info you were looking for?
session timeout is 5 minutes
I'll remove our override, and change this back to the default of 30 seconds. We added this config for Racecar version < 2.0.0 to avoid cases where the whole batch could not be processed in time.
by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.
We were expecting to some of the consumers working correctly and committing their offsets, and slight progress, would be worth it. If you think that this makes it more likely to cause bigger issues then I think it makes sense to recreate
pods instead.
Try to simplify your config and see if that doesn't resolve the issue. It's tricky to debug these things, unfortunately. The problem could be in Racecar, rdkafka-ruby, or in librdkafka.
With the rebalancing during a rolling restart, could something like this be the issue: https://github.com/edenhill/librdkafka/issues/2782
I have not run the rdkafka based Racecar v2, but from looking at the source it appears that offsets are manually committed as in the steps described in the issue above.
The issue above was fixed in librdkafka v1.4.2, but the current rdkafka gem uses v1.4.0.
Great find @tjwp! Looking through https://github.com/edenhill/librdkafka/issues/2782 https://github.com/edenhill/librdkafka/issues/1767 and https://github.com/confluentinc/confluent-kafka-go/issues/209, this does seem like the same issue. I guess now we wait for rdkafka-ruby to update to librdkafka
v1.4.2 before this is resolved on racecar as well.
I have removed the session_timeout
override in our config (we haven't needed it since racecar version > 2.0.0).
Thanks for looking into it, and please close this unless you want to keep it open as a reference.
We had an incident yesterday with this, but without any deployment. The code was running for weeks (with perhaps the occasional dyno restart).
A single partition was reset and suddenly reported a 1.6M messages lag. I checked and we didn't have a spike of messages being produced, so it was definitely the offset for that partition getting somehow reset to a very old offset.
Given there was no deployment event I suspect it might indeed be https://github.com/edenhill/librdkafka/issues/2782 or a similar issue caused by rebalancing.
So there's no recourse to this aside from manually resetting the offset back until we get the upstream bug fix onto rdkafka-ruby
and finally into racecar
?
@Draiken but racecar uses 1.9.0
that has offset reset capabilities: https://github.com/edenhill/librdkafka/releases/tag/v1.9.0 and as far as I remember there is an error when offset cannot be stored.
Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2
and assumed it was there only on that release...
You are correct and it's been released on 1.9.0
so I can update racecar
and get this fix.
TYVM @mensfeld
Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...
I made the same mistake and started updating rdkafka to 1.9.2 :roll_eyes:
Patch to this is already in place in rdkafka so after the upgrade you should see it go away.
We had an incident where 2 consumers in the same consumer reprocessed messages with older offsets (they moved back around 500,000 messages each). This happened shortly after a deploy. I am posting about this here incase someone has ideas on why this could have happened, or if this could point out a bug in racecar / rdkafka-ruby. This could be a bug with how rebalancing is handled, but I haven't been able to identify the code path that could cause this.
Our Racecar setup
2.1.0
config.session_timeout
is 5.minutes (default is 30 seconds)rolling update
on deploys instead ofRecreate
. We know that the docs suggest to recreate, but we have opted for higher availability at the cost of double processing during rebalancing. This may be the root cause, but it could also be a red herring.Details of what happened
Last week, we saw that the consumer lag spiked by 1 million shortly after a deploy. We noticed that consumers working on
partition 3
andpartition 6
were re-processing old messages.Partition 3 behaviour
7:18:00 AM
Pod A was working on offset 1669811707:18:00 AM
and7:18:02 AM
Pod B worked on offset 166429818 -166429915
7:18:02 AM
Pod B logged that it was gracefully shutting downPartition 6 behaviour
7:17:49 AM
7:18:30 AM
and7:18:31 AM
Pod B worked on offset 167276532-167276679
For Partition 3, the offset issue happened when another consumer took over. For partition 6, the offset issue happened after a proper shutdown. We have not noticed any broker issues that would cause this, so the current theory is that this is related to our Kafka consumer. Would love some thoughts / ideas