yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.93k stars 1.06k forks source link

[yugabyted] ERROR: A node is already running on #21687

Open onyn opened 6 months ago

onyn commented 6 months ago

Jira Link: DB-10571

Description

Playing around with yugabytedb. Created 3-node universe with nodes running in docker:

node1.yml ```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.1" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```
node2.yml ```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.2" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" ```
node3.yml ```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.3" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```

Universe formed successfully. Several days after I noticed new version of yugabyte docker image (2.20.2.1-b3). And tried to upgrade universe to new version. Started from node1 (follower):

node1.yml ```diff services: yugabyte: - image: yugabytedb/yugabyte:2.20.2.0-b145 + image: yugabytedb/yugabyte:2.20.2.1-b3 network_mode: host volumes: ```

After upgrading, node1 failed to start with cryptic message ERROR: A node is already running on 192.168.0.2, please specify a valid address.:

full log ``` [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Running yugabyted command: 'bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2' [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | cmd = start using config file: /home/yugabyte/yb_data/conf/yugabyted.conf (args.config=None) [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Found directory /home/yugabyte/bin for file openssl_proxy.sh [yugabyted start] 2024-03-26 11:39:28,449 INFO: | 0.0s | Found directory /home/yugabyte/bin for file yb-admin [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Fetching configs from join IP... [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Trying to get masters information from http://192.168.0.2:9000/api/v1/masters (Timeout=60) [yugabyted start] 2024-03-26 11:39:28,458 DEBUG: | 0.0s | Tserver 192.168.0.2 returned the followingmaster leader 192.168.0.2. [yugabyted start] 2024-03-26 11:39:28,461 ERROR: | 0.0s | ERROR: A node is already running on 192.168.0.2, please specify a valid address. For more information, check the logs in /home/yugabyte/yb_data/logs ```
curl http://192.168.0.2:9000/api/v1/masters ```json { "master_server_and_type": [ { "master_server": "192.168.0.1:7100", "is_leader": false }, { "master_server": "192.168.0.2:7100", "is_leader": true }, { "master_server": "192.168.0.3:7100", "is_leader": false } ] } ```

Rolling back to 2.20.2.0-b145 does not fix problem. Still fails with same error message.

What's wrong with my setup?

Warning: Please confirm that this issue does not contain any sensitive information

ddorian commented 6 months ago

Hi @onyn

After upgrading, node1 failed to start with cryptic message ERROR: A node is already running on 192.168.0.2, please specify a valid address.:

onyn commented 6 months ago

Which process printed this error and what command triggered it?

Command was bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2. So I think this is triggered by yugabyted. You may find details in the full log spoiler.

Are you using the same volume for all 3 nodes? I would expect to have different volumes/paths.

Every yugabyte instance runs on its own bare metal server. So their data do not clash.

What did you do by "upgrade" here? Meaning, just restart the container?

I mean running docker compose up -d command after specifying new image version in docker compose config. This command stops a running container, remove it and run new container with newer yugabyte binaries.

Can you repeat the same thing in normal processes?

Yes, this is reproducible even if I download binaries from downloads.yugabyte.com and run yugabyted by hand from command line. Also I found that upgrading binaries not necessary to triggering error. Simple restart was enough. See reproducer below:

reproducer ```bash # On every node: wget https://downloads.yugabyte.com/releases/2.20.2.1/yugabyte-2.20.2.1-b3-linux-x86_64.tar.gz tar zxvf yugabyte-2.20.2.1-b3-linux-x86_64.tar.gz cd yugabyte-2.20.2.1 bin/post_install.sh # On node2 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.2 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} # On node1 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 # On node3 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.3 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 ``` Then on node1 I press CTRL+C to stop yugabyted. Verify that no yugabyte processes left and then start exactly same command as was earlier: ```bash $ bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 ERROR: A node is already running on 192.168.0.2, please specify a valid address. For more information, check the logs in /home/onyn/yugabyte_data/logs ```

Also note that error told about node2 IP but I restart only node1.

Maybe the old container is still running or smth?

No. I double checked this.

ddorian commented 6 months ago

Can you do yugabyted collect_logs --base_dir=... on the one that failed to start and upload them here?

onyn commented 6 months ago
$ bin/yugabyted collect_logs --base_dir=/home/onyn/yugabyte_data

ERROR: No YugabyteDB node is running in the data_dir /home/onyn/yugabyte_data/data
For more information, check the logs in /home/onyn/yugabyte_data/logs
ddorian commented 6 months ago

You need to run it inside the docker container maybe?

onyn commented 6 months ago

Can you repeat the same thing in normal processes?

As you told, I continue experiments using normal process, not docker. Container is down because of ERROR: A node is already running on 192.168.0.2, please specify a valid address. Normal process down by same reasons. Why it's down is the main question of this ticket.

ddorian commented 6 months ago

@onyn is yugabyted process maybe still running when you get that error?

onyn commented 6 months ago

No. I triple checked this. As I said before, yugabyte instance on node1 refers to node2 in its error message. That's confuses me.