zendesk / ruby-kafka

A Ruby client library for Apache Kafka
http://www.rubydoc.info/gems/ruby-kafka
Apache License 2.0
1.27k stars 338 forks source link

Unexpected failure during shutdown after disconnect from a restarting Broker #960

Closed oriahu closed 1 year ago

oriahu commented 2 years ago

If this is a bug report, please fill out the following:

Please verify that the problem you're seeing hasn't been fixed by the current master of ruby-kafka.

Doesn't appear 1.5 introduces any fixes related to this, afaict.

Steps to reproduce

Broker restart seemed to cause this issue.

We didn't see this problem with a previous vendor AND also not using SASL creds.

Now we see this issue and we are using SASL PLAIN with TLS enabled.

We have been unable to reproduce locally; and do not want to attempt to reproduce in the live environment at this time.

Expected outcome

Client shuts down cleanly, (basically, letting the pod restart the process on it's own) Alternatively, Client recovers cleanly and reconnects to broker.

Actual outcome

Brokers rolling restarted (per their upgrade policy) caused this sequence of events in all of our ruby-kafka clients:

Here is a sequence of events:

Error committing offsets: Kafka::NotCoordinatorForGroup
Error sending heartbeat: Kafka::RebalanceInProgress.
Failed to fetch from events/1: Kafka::NotLeaderForPartition
Error committing offsets: Kafka::NotCoordinatorForGroup
Error committing offsets: Connection error EOFError: end of file reached
ruby-kafka-1.4.0/lib/kafka/ssl_socket_with_timeout.rb:69:in connect_nonblock': SSL_connect SYSCALL returned=5 errno=0 state=SSLv3/TLS write client hello (OpenSSL::SSL::SSLError)
(the ruby consumer must have attempted to shutdown after some retries)

undefined method join' for nil:NilClass ...

We think a simple fix could be this: https://github.com/zendesk/ruby-kafka/pull/959

Though we are not entirely familiar with the ramifications of this change. (or even how @thread could have been nil to begin with, in the fetcher.) We think it has something to do with using SASL with TLS (ssl cert from system).

In any event, perhaps skipping thread join when thread doesn't exist can let the process finish shutting down.

github-actions[bot] commented 1 year ago

Issue has been marked as stale due to a lack of activity.