yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.69k stars 1.04k forks source link

[DocDB] Improve the signal to noise ratio in the database log messages #23030

Open ssherwood opened 1 month ago

ssherwood commented 1 month ago

Jira Link: DB-11958

Description

Both the Master and the TServers generate a lot of excessive log messages, particularly at the WARN and ERROR level that are generally confusing to users. We should either reword or suppress many of these messages if they are not actually problems and, if they are, be more clear to what the problem is and how to address it.

Master examples:

W0624 12:32:01.351621   102 catalog_manager.cc:1705] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:12201): Node 112c59c51c92441085d892b472dee3d9 peer not initialized.

W0624 12:32:01.357802   108 catalog_manager.cc:1705] Failed to get current config: Illegal state (yb/master/catalog_manager.cc:12201): Node 112c59c51c92441085d892b472dee3d9 peer not initialized.

W0624 12:32:01.457957   107 log_util.cc:240] Could not read footer for segment: /root/var/data/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/wal-000000002: Not found (yb/consensus/log_util.cc:580): Footer not found. Footer magic doesn't match

W0624 12:32:01.458201   107 log_reader.cc:217] T 00000000000000000000000000000000 P 112c59c51c92441085d892b472dee3d9: Log segment /root/var/data/yb-data/master/wals/table-sys.catalog.uuid/tablet-00000000000000000000000000000000/wal-000000002 was likely left in-progressafter a previous crash. Will try to reuse this segment as writable active segment

W0624 12:32:01.472800   107 remove_intents_task.cc:58] Remove intents task failed: Aborted (yb/tablet/tablet_peer.cc:1638): Thread pool not ready

W0624 12:32:01.540099   112 client_master_rpc.cc:92] LookupByIdRpc(tablet: f6c2566958b04249aca736077a45de29, num_attempts: 1): Leader Master has changed (172.17.0.2:7100 is no longer the leader), re-trying...

W0624 12:32:01.564503   130 catalog_manager_bg_tasks.cc:173] Catalog manager background task thread going to sleep: Service unavailable (yb/master/scoped_leader_shared_lock.cc:91): Catalog manager is not initialized. State: 1

W0624 12:32:02.050343   112 client_master_rpc.cc:92] LookupByIdRpc(tablet: 6aa36c3ba501483eaf6aae660be2daaf, num_attempts: 1): Leader Master has changed (172.17.0.2:7100 is no longer the leader), re-trying...

W0624 12:32:05.341504   112 catalog_manager.cc:8490] T 00000000000000000000000000000000 P 112c59c51c92441085d892b472dee3d9: Invalid tablet report from b374cab38d4b4f9e8bc8db468909dcc4: Received an incremental tablet report when a full one was needed

TServer examples

W0509 16:35:49.340444   164 log_util.cc:240] Could not read footer for segment: /root/var/data/yb-data/tserver/wals/table-64163067f97a436d86418519d192c7bc/tablet-dbb3fa76724948edb162b7ba10fbc2f6/wal-000000001: Not found (yb/consensus/log_util.cc:580): Footer not found. Footer magic doesn't match

W0620 13:46:40.996364  8930 log_reader.cc:217] T a3db3f9f042e45ca912dc787bba25432 P 77d0d4de0ac744faa96c088e25747afa: Log segment /mnt/d0/yb-data/tserver/wals/table-1e8f4b7533124956ad28ebd7362ad144/tablet-a3db3f9f042e45ca912dc787bba25432/wal-000000001 was likely left in-progressafter a previous crash. Will try to reuse this segment as writable active segment

W0618 19:03:34.701647 107020 raft_consensus.cc:1395] Leader lease expiration was not set: Illegal state (yb/consensus/replica_state.cc:1415): NoOp of current term is not committed (majority_replicated_lease_expiration_ = -inf, majority_replicated_ht_lease_expiration_ = 0)

W0618 19:03:37.359730 107064 cql_processor.cc:387] Unsupported driver option DRIVER_NAME = DataStax Java Driver
W0618 19:05:04.232563 107064 cql_processor.cc:387] Unsupported driver option DRIVER_VERSION = 3.25.0.2

There are also several coming from CDC via AsyncYBClient that I will have to sanitize before I can share them, e.g.

Some clients are left in the client cache and haven't been cleaned up

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

eugeneckim commented 4 weeks ago
I0405 03:23:19.764725 14875 reactor.cc:459] Master_R014: DEBUG:
Closing idle connection: Connection (0x0000000022ad2eb0) server
AAA.BBB.CCC.DDD:PPPP => EEE.FFF.GGG.HHH:7100 - it has been idle for
65.0009s

W0405 04:05:27.798019 14857 connection.cc:294] Connection
(0x0000000023810eb0) server AAA.BBB.CCC.DDD:PPPPP => EEE.FFF.GGG.HHH:7100:
Command sequence failure: Network error (yb/rpc/yb_rpc.cc:148):
Invalid connection header: <Very long string, for example: 16030002D7010002D303003B19F0CE...>