Open myazinn opened 1 week ago
Hi @myazinn . Thanks for the extensive report! Very useful.
This is complicated stuff, I have to think about it more before I can give a good answer. For now: because zio-kafka doesn't run the streams (the user does) zio-kafka does not have full control over them. In particular we cannot interrupt the stream at any moment, only when it needs to fetch more records. In the example this happens every 5 records. This explains why offset 0 is committed even though processing took 10 seconds. Unfortunately, the commit does not complete, otherwise the retry would skip offset 0. I have to think a bit more on why the first consumer does not completely close and even tries to complete the commit (and eventually fails).
As a quick fix: it is perfectly fine to raise maxPollInterval
. Set it to a couple of hours if you need to, a day, 2 days, all fine. The only downside is that the broker will take longer to detect a dead-locked consumer. But hopefully you have other guards against this already.
Thanks @erikvanoosten! Yeah, that's what we've decided to do as well (increasing maxPollInterval
, and also removing .retry
on a stream just in case).
I understand that's it's quite tricky to interrupt a Stream which you don't fully control. One solution that I could think of is to expose some Promise that the user can race against and interrupt his own code immediately, but it seems like an awkward API and it's unlikely that anyone would even know about it. So I just hope that eventually someone will find the solution which will work automatically
Here's a self-contained test that reproduces the issue
It models a scenario when event handling takes more time than
max.poll.interval.ms
value (which happens for us from time to time :( ). The expected behaviour seems to be that once the poll interval is exceeded, stream is interrupted along with all child fibers. So each time we "retry" Kafka stream, it should behave like it starts from scratch and nothing happened. But that's not what we see here. Oncemax.poll.interval.ms
is exceeded, zio-kafka "forgets" the stream and re-subscribes to Kafka. But the "forgotten" stream is not actually dead and messes with RunLoop state. Here's what you'll get if you run the testNote that 1) We commit a record with offset 0 twice, even though the first iteration should've never gone that far. 2) That commit from a first iteration eventually breaks the working stream with mysterious
CommitTimeout
exception. The fail itself could be ok, but the exact exception is extremely misleading.And there's more. In a more "real-world" scenario it could lead to stream hanging indefinitely. Here's how (leaving only the important part)
Here we have two topics, one of which is perfectly fine and the second one would fail on first iteration. Eventually that "broken" iteration will fail both streams, and test will never complete. Here's what you'll get when you run the test
And that's what've actually encountered :( It seems that it is caused by a workaround for this issue, but I'm not sure. Let me know if you need anything else from my side