wvanbergen / kafka

Load-balancing, resuming Kafka consumer for go, backed by Zookeeper.
MIT License
373 stars 141 forks source link

some partitions have no owner #102

Open Huang-lin opened 8 years ago

Huang-lin commented 8 years ago

Hi, I find a problem that when I restart my consumers, a partition has no owner.

Consumer group will trigger rebalance when I stop a consumer. Its timeout in function finalizePartition is cg.config.Offsets.ProcessingTimeout, and retrial times is also cg.config.Offsets.ProcessingTimeout in rebalancing.

It should finish some logical processing which it was doing and running function finalizePartition before a consumer stop. This operating maybe spend time more than cg.config.Offset.ProcessingTimeout, so rebalancing will maybe fail, and some partitions have no owner.

To solve this problem, maybe we can add a goroutine to watch partition's owner, and it can also avoid some problem when partition numbers make changes such as kafka broker capacity expansion.

jimmy1911 commented 8 years ago

@wvanbergen I have the same trouble.

wvanbergen commented 8 years ago

In my experience, the most common reason for partitions not being consumed is locks still being present in zookeeper because a previous consumer was not shut down properly. After the zookeeper connection timeout, these locks should be garbage collected and the consumergroup should pickup the partitions. However, it's possible there are bugs in this.

It's unlikely that I'll be working on this myself any time soon. I am happy to accept patches for this though.

Huang-lin commented 8 years ago

I think watching partition owner is difficult because we don't know when consumer finish registering owner to zookeeper. It looks like a better method to solve this trouble that trigger rebalance after claim partition failed. I have finished this patch and created pull request.

rengawm commented 8 years ago

There are now three PRs open which attempt to address roughly the same problem with varying amounts of change: https://github.com/wvanbergen/kafka/pull/88 https://github.com/wvanbergen/kafka/pull/101 https://github.com/wvanbergen/kafka/pull/103

@wvanbergen At least in my experience with this particular issue, I've seen two race conditions. I've documented them in the comment on https://github.com/wvanbergen/kafka/pull/101.