Open josdirksen opened 4 weeks ago
Thanks @josdirksen , that is some awesome spelunking there. I think you have found all the right places in the code, and also your analysis seems correct.
We can fix requestAndAwaitData
by racing together with interruptionPromise.await
:
requestAndAwaitData =
for {
_ <- commandQueue.offer(RunloopCommand.Request(tp))
_ <- diagnostics.emit(DiagnosticEvent.Request(tp))
taken <- dataQueue
.takeBetween(1, Int.MaxValue)
.race(interruptionPromise.await) // <-- added race here
} yield taken
should do it.
However, I am beginning to wonder if we should fail the stream like this at all! It seems that lost partitions are more common then we thought. I was always working under the assumption that lost partitions are 'end of the world' type of situations, e.g. network splits, and network lost for a log time, where any processing that is still going on should be aborted ASAP.
Perhaps we should return to the situation we had before, where we treated a lost partition the same as a revoked partition. OR, we could treat it as a revoked partition when the internal queues are empty anyway... 🤔
This might be related to the https://github.com/zio/zio-kafka/issues/1233 issue, but the last couple of weeks / months we see issues where after a partition is lost, it isn't recovering correctly. We've tried to analyze or debug it, but this occurs so infrequently that we haven't been able to isolate it.
By analyzing the code we might have identified the reason, but there is so much async stuff happening there, that we might be interpreting stuff wrongly.
What happens in our case is the following:
For partitions that are revoked everything seems to be working correctly though.
What we see as possible cause for this is this. In the
Runloop
this happens for lost partitions:Resulting in this call in the
PartitionStreamControl
:Looking at the way the
interruptionPromise
is handled this doesn't seem to work correctly when there are no records to be processed. InPartitionStreamControl
we've got this repeating effect:And here the
interruptionPromise
is checked to see if we need to interrupt this effect. But, how would this work if there are no active chunks to process? TherequestAndAawaitData
function:Blocks the current fiber until at least 1 element is taken. So when the
lost
function fails the promise, that promise is never checked, since there are no records coming in on thedataQueue
(or I'm reading stuff wrong here, which is of course also possible).For the revoke flow, the
dataQueue
gets an additionalTake.end), to get out of the
requestAndAwaitDatawait state. But that doesn't happen for the
lost` scenario.So, shouldn't the code for lost also make sure the dataQueue at least gets some value, since it seems to be stuck in the
requestAndAwaitData
loop indefinitely.