Before 2.7.0 a lost partition was treated as a revoked partition. Since the partition is already assigned to another node, this potentially leads to duplicate processing of records.
Zio-kafka 2.7.0 assumes that a lost partition is a fatal event. It leads to an interrupt in the stream that handles the partition. The other streams are ended, and the consumer closes with an error. Usually, a full program restart is needed to resume consuming.
It should be noted that stream processing is not interrupted immediately. Only when the stream requests new records, the interrupt is observed. Unfortunately, we have not found a clean way to interrupt the stream consumer directly.
Meanwhile, from bug reports (#1233, #1250), we understand that partitions are usually lost when no records have been received for a long time.
In conclusion, 1) it is not possible to immediately interrupt user stream processing, and 2) it is most likely not needed anyway because the stream is already done processing and awaiting more records.
With this change, a lost partition no longer leads to an interrupt. Instead, we first drain the stream's internal queue (just to be sure, it is probably already empty), and then we end the stream gracefully (that is, without error, like we do with revoked partitions). Other streams are not affected, the consumer will continue to work.
Lost partitions do not affect the features rebalanceSafeCommits and restartStreamsOnRebalancing; they do not hold up a rebalance waiting for commits to complete, and they do not lead to restarts of other streams.
Since we currently have no way to test lost partitions, there is no change to the tests.
Before 2.7.0 a lost partition was treated as a revoked partition. Since the partition is already assigned to another node, this potentially leads to duplicate processing of records.
Zio-kafka 2.7.0 assumes that a lost partition is a fatal event. It leads to an interrupt in the stream that handles the partition. The other streams are ended, and the consumer closes with an error. Usually, a full program restart is needed to resume consuming.
It should be noted that stream processing is not interrupted immediately. Only when the stream requests new records, the interrupt is observed. Unfortunately, we have not found a clean way to interrupt the stream consumer directly.
Meanwhile, from bug reports (#1233, #1250), we understand that partitions are usually lost when no records have been received for a long time.
In conclusion, 1) it is not possible to immediately interrupt user stream processing, and 2) it is most likely not needed anyway because the stream is already done processing and awaiting more records.
With this change, a lost partition no longer leads to an interrupt. Instead, we first drain the stream's internal queue (just to be sure, it is probably already empty), and then we end the stream gracefully (that is, without error, like we do with revoked partitions). Other streams are not affected, the consumer will continue to work.
Lost partitions do not affect the features
rebalanceSafeCommits
andrestartStreamsOnRebalancing
; they do not hold up a rebalance waiting for commits to complete, and they do not lead to restarts of other streams.Since we currently have no way to test lost partitions, there is no change to the tests.
Fixes #1233 and #1250.