Closed PetrGlad closed 6 years ago
If there are no objections I probably can implement this myself.
Thank you @PetrGlad for your suggestion.
We reviewed it internally, but decided that it would be better not to implement this request. Here are the reasons that motivate this decision:
starting_offset
is 1 position earlier than oldest_available_offset
BEGIN
, which has the added benefit of never being out of date. By using a concrete offset, you always run the risk that it will expire between the moment you get the offset from Nakadi, and the moment you request events from this offset.Nevertheless, your proposal highlights a rough edge in Nakadi, and we will aim to improve our API and/or documentation to make its usage easier. If you have questions and/or alternative suggestions, we will of course be glad to help, and consider alternatives carefully. Feel free to re-open this issue if you would like to discuss it further.
I think Nakady should take responsibility of isolating clients from backend details and present consistent view of event stream in any case.
Yes, as I said in my description the new attribute does have risk of confusing users, this is one of possible backwards-compatible API changes. I think there could be alternative solutions.
In our case BEGIN being always up-to-date is disadvantage. We do want to fail fast if we miss data and see exactly how many events have we lost.
I hope you're OK with us doing things like
if (oldestCursor.offset().endsWith("--1"))
oldestCursor
else
subscriptionManager.shiftOffsetBackByOne(oldestCursor)
Now that we're working on new version of Diga, I think that we're going to use either lower level API or create subscriptions every time we start stream. That would allow us to avoid offset resets altogether.
This is an API improvement. Our Nakadi client is archiving incoming streams of events in an external persistent storage. Our goal is to store all events without data losses.
For our use cases we would like to get starting offset of event partition that would be consistent with other parts with Nakadi API. Namely, in other parts of API starting offset is "offset of first event - 1". But information about even type partition from partition_get provides oldest_available_offset which is offset that points exactly to fist available event. So to get starting stream offset that is consistent with other parts of API we would need to use cursor arithmetics to subtract "1" from it.
We know that there is placeholder offset
begin
that can can be used to specify oldest offset of the stream whatever it is. But in our case we would like toIn latter case we can work around by subtracting 1 from distance after cursors comparison with cursor-distances. But for check pointing we use starting offset, for instance, as an offset marker for stored data. Normally we use last event offset from last received event batch for this. But in case of restarts we might find that older events are already discarded and then we would like to still have actual value of "begin" offset for the same purpose.
We can use
Partition
'soldest_available_offset
as stream start but then this would make use to lose oldest event if we reset to this position, and this would introduce inconsistencies in our completeness checks where we detect data losses. This is especially visible problem for infrequent events that are generated in intervals comparable or longer than retention time. So for such events available event range normally consists of 1 or 2 events or is empty.In all our cases inclusive beginning offset makes it an additional special case that have to be handled separately in a Nakadi client's code. E.g. starting offset comparison, empty or 1 event stream case, and so on. Also it requires use to use cursor arithmetics with shifted-cursors to work around this inconsistency.
To make changes backwards compatible I suggest adding a new attribute to Partition description that would point to "oldest_available_offset - 1". The attribute name could be, say,
starting_offset
.To clarify:
begin
value at the moment