Open hulien22 opened 1 year ago
I searched our code base for load_balancer_initial_delay_secs
, and found a declared Google flag but no actual uses as far as I can tell except in tests.
Thus, I think this flag does nothing and the issue is already "done".
Shall I deprecate this flag using?
// In order to mark a flag as deprecated, use this macro:
// DEPRECATE_FLAG(int32, foo_flag, "10_2022")
// This will print a warning at startup if a process has been configured with the specified flag or
// any time the flags are updated at runtime with this flag.
//
// The third argument is a date in the format of "MM_YYYY" to make it easy to track when a flag was
// deprecated, so we may fully remove declarations in future releases.
// Flags are set to a dummy value
#define DEPRECATE_FLAG(type, name, date_mm_yyyy) \
@mdbridge see src/yb/master/catalog_manager_bg_tasks.cc
:
// Do the LB enabling check
if (!processed_tablets) {
if (catalog_manager_->TimeSinceElectedLeader() >
MonoDelta::FromSeconds(FLAGS_load_balancer_initial_delay_secs)) {
catalog_manager_->load_balance_policy_->RunLoadBalancer();
}
}
opps. A voice-o caused my search to be restricted to word boundaries. :-( Thanks, @hulien22.
One problem to consider: It is possible that for a given tablet, that all its peers have heartbeated in. The LB will work on this tablet, even if the other tservers haven't heartbeated in as yet.
To address this, TabletLoader logic can loop through the each tablet's committed consensus and register tservers in the consensus as dead. As each tserver comes up, catalog manager will mark it as alive. In the bg thread, check the liveness of all tservers. There are two situations:
Some experiments/metrics will need to be checked to ensure this doesn't slow down the loading phase too much.
(the discussion thread that led to Rahul's proposal discussion thread that led to Rahul's proposal: https://yugabyte.slack.com/archives/C03ULJYE0KG/p1680627506605999)
@rahuldesirazu can you please create a GitHub issue to implement your improved load balancer defense and make this issue depend on that one? Probably assign to @Sandeep for triaging.
@Sandeep I'm assigning this to you for triaging as it's no longer an eligible starter task.
Jira Link: DB-5713
Description
Currently we have
load_balancer_initial_delay_secs
(default of 120s) which delays the LB from starting upon a new master being elected, but this should not be needed anymore with recent LB improvements/hardening.From @bmatican (https://yugabyte.slack.com/archives/C4K1838GL/p1677788431192569?thread_ts=1677714909.572269&cid=C4K1838GL):