yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.88k stars 1.05k forks source link

[DocDb] client: reduce excessive retries on NotFound errors #5932

Open jaki opened 3 years ago

jaki commented 3 years ago

Jira Link: DB-10827 There are some cases where client RetryFunc keeps retrying RPCs that keep returning NotFound. For example, for YBClient::Data::IsCreateTableInProgress, it may get NotFound from master because

In the first case, it is good to keep retrying while receiving NotFound because we expect the table to eventually get created. In the second case, it is useless to keep retrying because the table is dead. This second case will cost deadline amount of time, default 120 seconds.

This is particularly a pain for index backfill YBClient::Data::WaitUntilIndexPermissionsAtLeast. If we wait on some permission like READ_WRITE_AND_DELETE, we could have

Distinguishing between the two NotFound cases is hard, especially when the permission changes can just finish in a snap.

I think this can be improved:

To avoid making the issue too large, I say it should be good to close when 2 of 3 above items are done. It can just be for one client function. Things that aren't covered can get separate smaller issues created for them.

tedyu commented 3 years ago

bq. don't delete the table until the waiters are done

Perhaps this should be bounded by certain time limit. Otherwise the table wouldn't be deleted for extended period of time.