wvanbergen / kafka

Load-balancing, resuming Kafka consumer for go, backed by Zookeeper.
MIT License
373 stars 141 forks source link

Kafkaconsumer stops consuming after ZK connection lost/timeout #80

Closed josselin-c closed 8 years ago

josselin-c commented 9 years ago

Using banch refactor (#72) and after applying PR @ samuel/go-zookeeper#84 to my code tree, my kafkaconsumer still fails to consume data when an IO timeout occurs on the connection to Zookeeper.

To reproduce you can prevent the consumer from accessing zookeeper for a few seconds:

root@root# iptables -A  OUTPUT -p tcp -m tcp --dport 2181 -j DROP # Add rule to block outgoing traffic to zookeeper
... Wait a few seconds...
root@root# iptables -D  OUTPUT -p tcp -m tcp --dport 2181 -j DROP # Remove rule

Log output (I changed consumerManager.run() so it runs every 10s):

15:19:51 Recv loop terminated: err=read tcp 10.10.12.62:2181: i/o timeout
15:19:51 Send loop terminated: err=<nil>
15:19:51 [instance=f8565f67ec4a] Failed to watch subscription: zk: connection closed. Trying again in 1 second...
15:19:53 Failed to connect to 10.10.12.62:2181: dial tcp 10.10.12.62:2181: i/o timeout
15:19:53 [instance=f8565f67ec4a] Failed to watch subscription: zk: could not connect to a server. Trying again in 1 second...
..
15:20:26 Connected to 10.10.12.62:2181
15:20:26 Authenticated: id=94641789739226871, timeout=4000
15:20:26 [instance=f8565f67ec4a] Currently, 0 instances are registered, to consume 2 partitions in total.
15:20:26 [instance=f8565f67ec4a] This instance is assigned to consume 0 partitions, and is currently consuming 2 partitions.
[Sarama] 2015/10/08 15:20:26 consumer/broker/0 closed dead subscription to rqueue.out.bs_msg_in/0
[Sarama] 2015/10/08 15:20:26 consumer/broker/0 closed dead subscription to rqueue.out.bs_msg_in/1
15:20:26 [instance=f8565f67ec4a partition=rqueue.out.bs_msg_in/0] Offset 579 has been processed. Continuing shutdown...
15:20:26 [instance=f8565f67ec4a partition=rqueue.out.bs_msg_in/1] FAILED to release partition: Cannot release partition: it is not claimed by this instance
15:20:26 [instance=f8565f67ec4a partition=rqueue.out.bs_msg_in/0] FAILED to release partition: Cannot release partition: it is not claimed by this instance
15:20:36 [instance=f8565f67ec4a] Currently, 0 instances are registered, to consume 2 partitions in total.
15:20:36 [instance=f8565f67ec4a] This instance is assigned to consume 0 partitions, and is currently consuming 0 partitions.
15:20:46 [instance=f8565f67ec4a] Currently, 0 instances are registered, to consume 2 partitions in total.
15:20:46 [instance=f8565f67ec4a] This instance is assigned to consume 0 partitions, and is currently consuming 0 partitions.
yejingx commented 8 years ago

i have the same problem.

jee1mr commented 8 years ago

@yejingx Does your commit fix the issue? @wvanbergen Is his PR going to be merged?

yejingx commented 8 years ago

@jee1mr yes, my commit fixed the issue on my side.

issac-lim commented 8 years ago

@wvanbergen please check @yejingx 's PR to fix the i/o timeout issue.

jee1mr commented 8 years ago

@yejingx @wvanbergen yes. That commit fixed for me too. Thanks.

heipacker commented 7 years ago

is this bug fixed? Why is there such a problem

chennqqi commented 6 years ago

still have this problem

XuejiaoZhang commented 6 years ago

In our case: After Zookeeper i/o timeout, we get the error: error while consuming ${topic}/${partition}: Cannot release partition: it is not claimed by this instance from ConsumerGroup.Errors(), and it reconnects to Zookeeper. Sometimes consuming stops, but it doesn't always happen this way, it may continue to consume.

@wvanbergen Hi. Any thoughts on this?

ShawnHsiung commented 5 years ago

In our case: After Zookeeper i/o timeout, we get the error: error while consuming ${topic}/${partition}: Cannot release partition: it is not claimed by this instance from ConsumerGroup.Errors(), and it reconnects to Zookeeper. Sometimes consuming stops, but it doesn't always happen this way, it may continue to consume.

@wvanbergen Hi. Any thoughts on this?

@XuejiaoZhang I meet the same issue with you, do you solve it?

XuejiaoZhang commented 5 years ago

In our case: After Zookeeper i/o timeout, we get the error: error while consuming ${topic}/${partition}: Cannot release partition: it is not claimed by this instance from ConsumerGroup.Errors(), and it reconnects to Zookeeper. Sometimes consuming stops, but it doesn't always happen this way, it may continue to consume. @wvanbergen Hi. Any thoughts on this?

@XuejiaoZhang I meet the same issue with you, do you solve it?

Unfortunately, no.

isogram commented 5 years ago

any updates for this case? i have same problem here. my temporary solution is re-run / rebuild the application.

wvanbergen commented 5 years ago

Hello. I am no longer actively working on this library myself, but if somebody has a fix I will gladly merge a PR.