Some consumers in a group reverted to an older offsets shortly after deploy

Hammadk commented 3 years ago

We had an incident where 2 consumers in the same consumer reprocessed messages with older offsets (they moved back around 500,000 messages each). This happened shortly after a deploy. I am posting about this here incase someone has ideas on why this could have happened, or if this could point out a bug in racecar / rdkafka-ruby. This could be a bug with how rebalancing is handled, but I haven't been able to identify the code path that could cause this.

Our Racecar setup

Racecar version 2.1.0
config.session_timeout is 5.minutes (default is 30 seconds)
We do a rolling update on deploys instead of Recreate. We know that the docs suggest to recreate, but we have opted for higher availability at the cost of double processing during rebalancing. This may be the root cause, but it could also be a red herring.
We have topic with 8 partitions, and we have a Racecar consumer that only subscribes to this topic.
We deploy this consumer with Kubernetes and this consumer has 8 replicas (same as # partitions)

Details of what happened

Last week, we saw that the consumer lag spiked by 1 million shortly after a deploy. We noticed that consumers working on partition 3 and partition 6 were re-processing old messages.

Partition 3 behaviour

Pod A was initially working on partition 3
A rolling deploying started
At 7:18:00 AM Pod A was working on offset 166981170
Here Pod B took over, and it started working on a message roughly 500,000k behind
Then between 7:18:00 AM and 7:18:02 AM Pod B worked on offset 166429818 -166429915
At 7:18:02 AM Pod B logged that it was gracefully shutting down
After that, another Pod started working on the partition

Partition 6 behaviour

Pod A was initially working on partition 6 on offset 167800011
Pod A was shutdown 7:17:49 AM
Here Pod B took over, and it started working on a message roughly 500,000k behind
Between 7:18:30 AM and 7:18:31 AM Pod B worked on offset 167276532-167276679

For Partition 3, the offset issue happened when another consumer took over. For partition 6, the offset issue happened after a proper shutdown. We have not noticed any broker issues that would cause this, so the current theory is that this is related to our Kafka consumer. Would love some thoughts / ideas

dasch commented 3 years ago

If your session timeout is 5 minutes, that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

What is the offset commit config for your consumers?

dasch commented 3 years ago

Also: by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

Hammadk commented 3 years ago

If your session timeout is 5 minutes that means that a consumer can theoretically process messages for 5 minutes before figuring out that the group has been re-synchronized. At that point, the consumer cannot commit its offsets, so the messages will be re-processed.

The odd thing is that the consumer reverted back to an offset that it successfully processed 17 hours ago. This large change in offset is outside standard re-processing during rebalancing. I wonder if the consumer is going back to the beginning of the partition. Is there a bug or flow that could cause this? Could you point me to the relevant rebalancing logic?

Note that I don't have a good way of identifying the first offset for this partition, at the time of the incident. Technically, our retention policy is 3 days, so if this went back to the beginning, it should have gone even further back.

What is the offset commit config for your consumers?

We don't custom offset commit config, and we use the default offset_commit_interval 10 seconds config from Racecar. Is that the info you were looking for?

session timeout is 5 minutes

I'll remove our override, and change this back to the default of 30 seconds. We added this config for Racecar version < 2.0.0 to avoid cases where the whole batch could not be processed in time.

by opting to use rolling updates you'll get these sorts of issues. I'm also not quite sure why you'd get more availability – the whole group needs to re-sync every time a member comes or goes, so you'll have a prolonged period of chaos, during which double-processing will be more likely than not.

We were expecting to some of the consumers working correctly and committing their offsets, and slight progress, would be worth it. If you think that this makes it more likely to cause bigger issues then I think it makes sense to recreate pods instead.

dasch commented 3 years ago

Try to simplify your config and see if that doesn't resolve the issue. It's tricky to debug these things, unfortunately. The problem could be in Racecar, rdkafka-ruby, or in librdkafka.

tjwp commented 3 years ago

With the rebalancing during a rolling restart, could something like this be the issue: https://github.com/edenhill/librdkafka/issues/2782

I have not run the rdkafka based Racecar v2, but from looking at the source it appears that offsets are manually committed as in the steps described in the issue above.

The issue above was fixed in librdkafka v1.4.2, but the current rdkafka gem uses v1.4.0.

Hammadk commented 3 years ago

Great find @tjwp! Looking through https://github.com/edenhill/librdkafka/issues/2782 https://github.com/edenhill/librdkafka/issues/1767 and https://github.com/confluentinc/confluent-kafka-go/issues/209, this does seem like the same issue. I guess now we wait for rdkafka-ruby to update to librdkafka v1.4.2 before this is resolved on racecar as well.

I have removed the session_timeout override in our config (we haven't needed it since racecar version > 2.0.0).

Thanks for looking into it, and please close this unless you want to keep it open as a reference.

Draiken commented 2 years ago

We had an incident yesterday with this, but without any deployment. The code was running for weeks (with perhaps the occasional dyno restart).

A single partition was reset and suddenly reported a 1.6M messages lag. I checked and we didn't have a spike of messages being produced, so it was definitely the offset for that partition getting somehow reset to a very old offset.

Given there was no deployment event I suspect it might indeed be https://github.com/edenhill/librdkafka/issues/2782 or a similar issue caused by rebalancing.

So there's no recourse to this aside from manually resetting the offset back until we get the upstream bug fix onto rdkafka-ruby and finally into racecar?

mensfeld commented 2 years ago

@Draiken but racecar uses 1.9.0 that has offset reset capabilities: https://github.com/edenhill/librdkafka/releases/tag/v1.9.0 and as far as I remember there is an error when offset cannot be stored.

Draiken commented 2 years ago

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release... You are correct and it's been released on 1.9.0 so I can update racecar and get this fix.

TYVM @mensfeld

mensfeld commented 2 years ago

Oh damn, that was my bad. I looked at the commit that closed the parent issue, saw v1.9.2 and assumed it was there only on that release...

I made the same mistake and started updating rdkafka to 1.9.2 :roll_eyes:

Patch to this is already in place in rdkafka so after the upgrade you should see it go away.

zendesk / racecar

Some consumers in a group reverted to an older offsets shortly after deploy #198

Our Racecar setup

Details of what happened