Jira Link: DB-10827
There are some cases where client RetryFunc keeps retrying RPCs that keep returning NotFound. For example, for YBClient::Data::IsCreateTableInProgress, it may get NotFound from master because
the table hasn't been created yet
the table failed to create and was deleted
In the first case, it is good to keep retrying while receiving NotFound because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will cost deadline amount of time, default 120 seconds.
This is particularly a pain for index backfill YBClient::Data::WaitUntilIndexPermissionsAtLeast. If we wait on some permission like READ_WRITE_AND_DELETE, we could have
the index isn't even created yet: NotFound that should be retried
the index is at a permission before that: retry
the index is at that permission: done
the index is past that permission: done
the index is deleted: NotFound that shouldn't be retried
Distinguishing between the two NotFound cases is hard, especially when the permission changes can just finish in a snap.
I think this can be improved:
Keep track of whether we ever saw the table live so that if we get a NotFound, we can safely say it is deleted rather than waiting to be created (there will still be cases where things happen so quickly that we don't see the table and it gets deleted)
Check the status for select fatal errors like we do in src/yb/master/async_rpc_tasks.cc, and don't retry when we see one of those (this is a generic enhancement)
(Stretch) Make master aware of waiters (e.g. client calling IsCreateTableInProgress) and don't delete the table until the waiters are done, instead marking the table as deleted. That way, we can distinguish NotFound from precreation to some other status/message for deletion. (This should solve the problem, but it opens up other handling concerns.)
To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.
Jira Link: DB-10827 There are some cases where client
RetryFunc
keeps retrying RPCs that keep returningNotFound
. For example, forYBClient::Data::IsCreateTableInProgress
, it may getNotFound
from master becauseIn the first case, it is good to keep retrying while receiving
NotFound
because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will costdeadline
amount of time, default 120 seconds.This is particularly a pain for index backfill
YBClient::Data::WaitUntilIndexPermissionsAtLeast
. If we wait on some permission likeREAD_WRITE_AND_DELETE
, we could haveNotFound
that should be retriedNotFound
that shouldn't be retriedDistinguishing between the two
NotFound
cases is hard, especially when the permission changes can just finish in a snap.I think this can be improved:
NotFound
, we can safely say it is deleted rather than waiting to be created (there will still be cases where things happen so quickly that we don't see the table and it gets deleted)src/yb/master/async_rpc_tasks.cc
, and don't retry when we see one of those (this is a generic enhancement)IsCreateTableInProgress
) and don't delete the table until the waiters are done, instead marking the table as deleted. That way, we can distinguishNotFound
from precreation to some other status/message for deletion. (This should solve the problem, but it opens up other handling concerns.)To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.