Unexpected failure during shutdown after disconnect from a restarting Broker

If this is a bug report, please fill out the following:

Version of Ruby: 2.6.x, 2.7.x, possibly 2.x
Version of Kafka: Confluent Cloud Kafka, based on this we are on 3.2.
Version of ruby-kafka: 1.4.0

Please verify that the problem you're seeing hasn't been fixed by the current `master` of ruby-kafka.

Doesn't appear 1.5 introduces any fixes related to this, afaict.

Steps to reproduce

Broker restart seemed to cause this issue.

We didn't see this problem with a previous vendor AND also not using SASL creds.

Now we see this issue and we are using SASL PLAIN with TLS enabled.

We have been unable to reproduce locally; and do not want to attempt to reproduce in the live environment at this time.

Expected outcome

Client shuts down cleanly, (basically, letting the pod restart the process on it's own) Alternatively, Client recovers cleanly and reconnects to broker.

Actual outcome

Brokers rolling restarted (per their upgrade policy) caused this sequence of events in all of our ruby-kafka clients:

Here is a sequence of events:

Error committing offsets: Kafka::NotCoordinatorForGroup
Error sending heartbeat: Kafka::RebalanceInProgress.
Failed to fetch from events/1: Kafka::NotLeaderForPartition
Error committing offsets: Kafka::NotCoordinatorForGroup
Error committing offsets: Connection error EOFError: end of file reached
ruby-kafka-1.4.0/lib/kafka/ssl_socket_with_timeout.rb:69:in connect_nonblock': SSL_connect SYSCALL returned=5 errno=0 state=SSLv3/TLS write client hello (OpenSSL::SSL::SSLError)
(the ruby consumer must have attempted to shutdown after some retries)

undefined method join' for nil:NilClass ...

We think a simple fix could be this: https://github.com/zendesk/ruby-kafka/pull/959

Though we are not entirely familiar with the ramifications of this change. (or even how @thread could have been nil to begin with, in the fetcher.) We think it has something to do with using SASL with TLS (ssl cert from system).

In any event, perhaps skipping thread join when thread doesn't exist can let the process finish shutting down.

zendesk / ruby-kafka