yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[YSQL] tserver unable to start if DNS does not resolve 1 of 3 master servers #24395

Open hmacias-avaya opened 1 month ago

hmacias-avaya commented 1 month ago

Jira Link: DB-13306

Description

I've asked this through the slack channel: https://yugabyte-db.slack.com/archives/CG0KQF0GG/p1728633440104989

I have a 3 node cluster with yugabyte deployed through helm. Each node is running a master and a tserver and everything is working fine.

Then I simulate a total node failure by powering off that node. 1 tserver and 1 master show up as Terminating and the other 2 continue to work correctly.

If I then restart one of the remaining 2 tservers, it fails to start as it cannot resolve the address of one of the masters (the master that is down due to the total node failure)

This may be related: https://github.com/yugabyte/yugabyte-db/issues/696 but the solution there was to increase the delay and retry for 1h.

If not being able to reach 1 out of 3 masters is "ok" and does not result in the tserver not properly starting, shouldn't the DNS resolve failure (which is caused by kubernetes not seeing the tserver as fully up yet) also be "ok" ? After manually adding an entry in /etc/hosts, looks like tserver is able to start successfully (even if that master in the /etc/hosts is not reachable). For example

1.1.1.1 yb-master-2.yb-masters.default.svc.cluster.local

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

hmacias-avaya commented 4 weeks ago

these are the logs from the tserver restarted (1 of the 2 remaining alive after simulating a node failure that terminates the 3rd tserver)

I1015 06:39:46.145413 1 server_main_util.cc:68] NumCPUs determined to be: 1 I1015 06:39:46.145702 1 mem_tracker.cc:255] Creating root MemTracker with garbage collection threshold 36490444 bytes I1015 06:39:46.145720 1 mem_tracker.cc:259] Root memory limit is 3649044480 I1015 06:39:46.145740 1 tcmalloc_util.cc:231] Setting tcmalloc max thread cache bytes to: 91226112 I1015 06:39:46.145761 1 tcmalloc_util.cc:264] Setting TCMalloc profiler sampling frequency to 1048576 bytes I1015 06:39:46.145776 1 mem_tracker.cc:213] TCMalloc per cpu caches active: 1 I1015 06:39:46.145789 1 mem_tracker.cc:215] TCMalloc max per cpu cache size: 1572864 I1015 06:39:46.145802 1 mem_tracker.cc:217] TCMalloc max total thread cache bytes: 91226112 I1015 06:39:46.145833 1 tablet_server_main_impl.cc:157] Using parsed rpc = 0.0.0.0:9100 I1015 06:39:46.145866 1 tablet_server_main_impl.cc:150] Reset YEDIS bind address to 0.0.0.0:6379 I1015 06:39:46.145928 1 server_base_options.cc:176] Updating master addrs to {yb-master-0.yb-masters.default.svc.cluster.local:7100},{yb-master-1.yb-masters.default.svc.cluster.local:7100},{yb-master-2.yb-masters.default.svc.cluster.local:7100} I1015 06:39:46.145957 1 server_base_options.cc:176] Updating master addrs to {yb-master-0.yb-masters.default.svc.cluster.local:7100},{yb-master-1.yb-masters.default.svc.cluster.local:7100},{yb-master-2.yb-masters.default.svc.cluster.local:7100} I1015 06:39:46.145988 1 server_base_options.cc:176] Updating master addrs to {yb-master-0.yb-masters.default.svc.cluster.local:7100},{yb-master-1.yb-masters.default.svc.cluster.local:7100},{yb-master-2.yb-masters.default.svc.cluster.local:7100} I1015 06:39:46.146400 1 shared_mem.cc:139] Using memfd_create as a shared memory provider I1015 06:39:46.146554 1 mem_tracker.cc:795] MemTracker: hard memory limit is 3.398438 GB I1015 06:39:46.146579 1 mem_tracker.cc:797] MemTracker: soft memory limit is 2.888672 GB I1015 06:39:46.146595 1 server_base_options.cc:176] Updating master addrs to {yb-master-0.yb-masters.default.svc.cluster.local:7100},{yb-master-1.yb-masters.default.svc.cluster.local:7100},{yb-master-2.yb-masters.default.svc.cluster.local:7100} I1015 06:39:46.147065 1 thread_pool.cc:165] Starting thread pool { name: raft_notifications max_workers: 18446744073709551615 } I1015 06:39:46.148972 1 docdb_rocksdb_util.cc:586] FLAGS_priority_thread_pool_size was not set, automatically configuring to 1. I1015 06:39:46.149250 1 full_compaction_manager.cc:98] Initialized full compaction manager check_interval_sec: 60 window_size_sec: 300 scheduled_compaction_frequency: 0.000s scheduled_jitter_factor: 33 I1015 06:39:46.149555 1 rpc_server.cc:84] yb::server::RpcServer created at 0x26183fc43c20 I1015 06:39:46.149583 1 tablet_server.cc:333] yb::tserver::TabletServer created at 0x26183fc4a000 I1015 06:39:46.149596 1 tablet_server.cc:334] yb::tserver::TSTabletManager created at 0x26183f9bcc00 I1015 06:39:46.149609 1 tablet_server_main_impl.cc:240] Initializing tablet server... F1015 07:39:58.655723 1 tablet_server_main_impl.cc:241] Configuration error (yb/server/server_base_options.cc:305): Couldn't resolve master service address 'yb-master-2.yb-masters.default.svc.cluster.local:7100' Fatal failure details written to /mnt/disk0/yb-data/tserver/logs/yb-tserver.FATAL.details.2024-10-15T07_39_58.pid1.txt F20241015 07:39:58 ../../src/yb/tserver/tablet_server_main_impl.cc:241] Configuration error (yb/server/server_base_options.cc:305): Couldn't resolve master service address 'yb-master-2.yb-masters.default.svc.cluster.local:7100' @ 0x55b8ce6a9b67 google::LogMessage::SendToLog() @ 0x55b8ce6aaacd google::LogMessage::Flush() @ 0x55b8ce6ab149 google::LogMessageFatal::~LogMessageFatal() @ 0x55b8cfd8a8f1 yb::tserver::TabletServerMain() @ 0x55b8ce65a6a9 main @ 0x7f2916523825 libc_start_main @ 0x55b8ce2c902e _start Check failure stack trace: @ 0x55b8ce6aa3e5 google::LogMessage::SendToLog() @ 0x55b8ce6aaacd google::LogMessage::Flush() @ 0x55b8ce6ab149 google::LogMessageFatal::~LogMessageFatal() @ 0x55b8cfd8a8f1 yb::tserver::TabletServerMain() @ 0x55b8ce65a6a9 main @ 0x7f2916523825 __libc_start_main @ 0x55b8ce2c902e _start Aborted at 1728977998 (unix time) try "date -d @1728977998" if you are using GNU date PC: @ 0x0 (unknown) SIGSEGV (@0x0) received by PID 1 (TID 0x7f29177e5180) from PID 0; stack trace: @ 0x7f29165375a6 GI_abort @ 0x55b8d0034a03 yb::(anonymous namespace)::DumpStackTraceAndExit() @ 0x55b8ce6aa3e5 google::LogMessage::SendToLog() @ 0x55b8ce6aaacd google::LogMessage::Flush() @ 0x55b8ce6ab149 google::LogMessageFatal::~LogMessageFatal() @ 0x55b8cfd8a8f1 yb::tserver::TabletServerMain() @ 0x55b8ce65a6a9 main @ 0x7f2916523825 __libc_start_main @ 0x55b8ce2c902e _start

The message Couldn't resolve master service address 'yb-master-2.yb-masters.default.svc.cluster.local:7100' refers to the master that is currently down, and the DNS does not resolve either because it is not even scheduled in kubernetes.