Open fritshoogland-yugabyte opened 1 year ago
@fritshoogland-yugabyte can you share a bit more about how you got into this state? It's really strange that the first 2 tservers have no hostname, but the last one, that's also dead, does showcase the hostname.
The main difference seems to be that it still exists in some quorums (has 4 peers on it). Did it also stop showing the hostname, after it was finally kicked out of its quorums?
It's been a while since I found this state, but I found a scenario where I can get the situation where the master/tablet-servers
page shows a number of DEAD tablet servers without their http address name, which causes the master/api/v1/tablet-servers
to fail showing all the DEAD tablet servers, because they are listed by http address name, and therefore only one with an empty name can be shown.
➜ yb_stats --print-tablet-servers
yb-1.local:9000 ALIVE Placement: local.local.local1
HB time: 0.4s, Uptime: 170, Ram 57.90 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 191152128 (1.78%)
yb-2.local:9000 ALIVE Placement: local.local.local2
HB time: 0.5s, Uptime: 146, Ram 56.62 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 201392128 (1.88%)
yb-3.local:9000 ALIVE Placement: local.local.local3
HB time: 0.4s, Uptime: 122, Ram 40.83 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 197783552 (1.84%)
yb-4.local:9000 ALIVE Placement: local.local.local4
HB time: 0.5s, Uptime: 97, Ram 51.38 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 170307584 (1.59%)
yb-5.local:9000 ALIVE Placement: local.local.local5
HB time: 0.7s, Uptime: 73, Ram 51.38 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 169508864 (1.58%)
yb-6.local:9000 ALIVE Placement: local.local.local6
HB time: 0.5s, Uptime: 47, Ram 58.72 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 158294016 (1.48%)
➜ yb_stats --print-tablet-servers
yb-1.local:9000 DEAD Placement: local.local.local1
HB time: 86.7s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 0/12
yb-2.local:9000 DEAD Placement: local.local.local2
HB time: 74.7s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
yb-3.local:9000 DEAD Placement: local.local.local3
HB time: 72.7s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
yb-4.local:9000 ALIVE Placement: local.local.local4
HB time: 0.5s, Uptime: 242, Ram 35.01 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
Path: /mnt/d0, total: 10724835328, used: 170307584 (1.59%)
yb-5.local:9000 ALIVE Placement: local.local.local5
HB time: 0.4s, Uptime: 218, Ram 34.04 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 169508864 (1.58%)
yb-6.local:9000 ALIVE Placement: local.local.local6
HB time: 0.5s, Uptime: 193, Ram 36.84 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
Path: /mnt/d0, total: 10724835328, used: 158089216 (1.47%)
➜ yb_stats --print-masters
52cfc19881954ca0b866ea3571abe726 LEADER Placement: local.local.local
Seqno: 1674563401730224 Start time: 1674563401730224
RPC addresses: ( yb-1.local:7100 )
HTTP addresses: ( yb-1.local:7000 )
6e82dcf19d3a4435b9fe27b8c5a97b1a FOLLOWER Placement: local.local.local
Seqno: 1674563426056547 Start time: 1674563426056547
RPC addresses: ( yb-2.local:7100 )
HTTP addresses: ( yb-2.local:7000 )
22e36d9b728841a485d395dedcbaa0ec FOLLOWER Placement: local.local.local
Seqno: 1674563451490385 Start time: 1674563451490385
RPC addresses: ( yb-3.local:7100 )
HTTP addresses: ( yb-3.local:7000 )
Current leader is yb-1.local / 52cfc19881954ca0b866ea3571abe726
Let's first put a watch on the cluster to see what changes by using yb_stats --adhoc-nonmetrics-diff (show diff, but not for metrics: we don't care about these, this is not about performance).
➜ yb_stats --adhoc-nonmetrics-diff
Begin ad-hoc in-memory snapshot created, press enter to create end snapshot for difference calculation.
And then make the master move leader state to a follower:
yb-admin -init_master_addrs localhost:7100 master_leader_stepdown 6e82dcf19d3a4435b9fe27b8c5a97b1a
Then press enter with yb_stats to see what has happened:
Time between snapshots: 124.432 seconds
= Masters: 52cfc19881954ca0b866ea3571abe726 Role: LEADER->FOLLOWER Placement: local.local.local
Seq#: 1674563401730224 Start time: 1674563401730224
RPC: yb-1.local:7100,
HTTP: yb-1.local:7000,
= Masters: 6e82dcf19d3a4435b9fe27b8c5a97b1a Role: FOLLOWER->LEADER Placement: local.local.local
Seq#: 1674563426056547 Start time: 1674563426056547
RPC: yb-2.local:7100,
HTTP: yb-2.local:7000,
+ Tserver: , status: DEAD, uptime: 0 s
- Tserver: yb-1.local:9000, status: DEAD, uptime: 0 s
- Tserver: yb-2.local:9000, status: DEAD, uptime: 0 s
- Tserver: yb-3.local:9000, status: DEAD, uptime: 0 s
Master yb-1 changed from LEADER to FOLLOWER, and master yb-2 changed from FOLLOWER to LEADER, as expected. For the servers, there are 3 things gone away ('-'), which are tablet server node yb-1, yb-2, yb-3, and there is one tserver "added" ('+').
We can guess how the tablet servers view looks like:
➜ yb_stats --print-tablet-servers
DEAD Placement: local.local.local3
HB time: 837.8s, Uptime: 0, Ram 0 B
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 0, user (leader/total): 0/0, system (leader/total): 0/4
yb-4.local:9000 ALIVE Placement: local.local.local4
HB time: 0.2s, Uptime: 779, Ram 35.84 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
Path: /mnt/d0, total: 10724835328, used: 161931264 (1.51%)
yb-5.local:9000 ALIVE Placement: local.local.local5
HB time: 0.3s, Uptime: 750, Ram 36.20 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 4/12
Path: /mnt/d0, total: 10724835328, used: 161136640 (1.50%)
yb-6.local:9000 ALIVE Placement: local.local.local6
HB time: 0.2s, Uptime: 724, Ram 38.85 MB
SST files: nr: 0, size: 0 B, uncompressed: 0 B
ops read: 0, write: 0
tablets: active: 12, user (leader/total): 0/0, system (leader/total): 5/12
Path: /mnt/d0, total: 10724835328, used: 149716992 (1.40%)
The previously not available/DEAD tablet servers are removed from the view, and one entry is added with an empty name. Based on the placement, which is unique for each tablet server in my (specific!) configuration, we can see it's yb-3, but without that, there is no way to identify which node the unnamed node is, because the only way to truly actually identify a tablet server is by its UUID.
The huge issue is that if you rely on this view, you get incomplete information, because these now unnamed nodes might actually host system or user tablets. All in all, these tablet servers must be shown to provide the actual information. And they are in master/tablet-servers.
Tablet servers are used internally by their UUID. The masters are simply listed without a name in /api/v1/masters, but each listed master does contain its unique UUID. It's confusing that the tablet servers equivalent endpoint is formatted radically different, and does not contain the vital data of its unique UUID and sequence number.
The view master/api/v1/health-check shows dead nodes, which are actually dead TABLET SERVERS by their UUID. For which there is no view to tell what the properties are of these tablet servers.
(master/dump-entities actually does show server uuid, together with the "addr", which is the http name, which is an odd place to do show it, if the UUID has a specification that can be obtained, all that is needed there is the UUID?)
Besides the UUID missing from the /api/v1/tablet-servers view, several fields are missing which are very helpful for determining the state of the tablet servers:
I think a more appropriate name for this endpoint would be: /api/v1/tablet-server-performance because it combines status and performance data per tablet server name.
Jira Link: DB-5024
Description
YugabyteDB 2.17.0.0b24 Linux Alma8.7
When multiple tablet servers are stopped, the HTTP endpoint name seems to be removed from the metadata the master keeps:
And the tablet server UUID is used.
However, in the endpoint
/api/v1/tablet-servers
, it uses a different construction than/api/v1/masters
, and show a map of objects (being the tablet servers) inside an unnamed object (?).The map of objects (which are the tablet servers) is listed by the name of the HTTP endpoint name. If the HTTP endpoint name is removed (like can be seen in the screenshot above), the name gets empty. If multiple names get empty, still only one is showed. This leads to incomplete data being shown.
Please list the tablet servers by their UUID, like is done for the masters, so
/api/v1/tablet-servers
can show the truth.Data in
/api/v1/tablet-servers
for above screenshot situation: