[yugabyted] ERROR: A node is already running on

onyn commented 6 months ago

Description

Playing around with yugabytedb. Created 3-node universe with nodes running in docker:

node1.yml

```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.1" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```

node2.yml

```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.2" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" ```

node3.yml

```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.3" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```

Universe formed successfully. Several days after I noticed new version of yugabyte docker image (2.20.2.1-b3). And tried to upgrade universe to new version. Started from node1 (follower):

node1.yml

```diff services: yugabyte: - image: yugabytedb/yugabyte:2.20.2.0-b145 + image: yugabytedb/yugabyte:2.20.2.1-b3 network_mode: host volumes: ```

After upgrading, node1 failed to start with cryptic message ERROR: A node is already running on 192.168.0.2, please specify a valid address.:

full log

``` [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Running yugabyted command: 'bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2' [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | cmd = start using config file: /home/yugabyte/yb_data/conf/yugabyted.conf (args.config=None) [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Found directory /home/yugabyte/bin for file openssl_proxy.sh [yugabyted start] 2024-03-26 11:39:28,449 INFO: | 0.0s | Found directory /home/yugabyte/bin for file yb-admin [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Fetching configs from join IP... [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Trying to get masters information from http://192.168.0.2:9000/api/v1/masters (Timeout=60) [yugabyted start] 2024-03-26 11:39:28,458 DEBUG: | 0.0s | Tserver 192.168.0.2 returned the followingmaster leader 192.168.0.2. [yugabyted start] 2024-03-26 11:39:28,461 ERROR: | 0.0s | ERROR: A node is already running on 192.168.0.2, please specify a valid address. For more information, check the logs in /home/yugabyte/yb_data/logs ```

curl http://192.168.0.2:9000/api/v1/masters

```json { "master_server_and_type": [ { "master_server": "192.168.0.1:7100", "is_leader": false }, { "master_server": "192.168.0.2:7100", "is_leader": true }, { "master_server": "192.168.0.3:7100", "is_leader": false } ] } ```

Rolling back to 2.20.2.0-b145 does not fix problem. Still fails with same error message.

What's wrong with my setup?

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

ddorian commented 6 months ago

Hi @onyn

After upgrading, node1 failed to start with cryptic message ERROR: A node is already running on 192.168.0.2, please specify a valid address.:

Which process printed this error and what command triggered it?
Are you using the same volume for all 3 nodes? I would expect to have different volumes/paths.
What did you do by "upgrade" here? Meaning, just restart the container?
Can you repeat the same thing in normal processes?
Maybe the old container is still running or smth?

onyn commented 6 months ago

Which process printed this error and what command triggered it?

Command was bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2. So I think this is triggered by yugabyted. You may find details in the full log spoiler.

Are you using the same volume for all 3 nodes? I would expect to have different volumes/paths.

Every yugabyte instance runs on its own bare metal server. So their data do not clash.

What did you do by "upgrade" here? Meaning, just restart the container?

I mean running docker compose up -d command after specifying new image version in docker compose config. This command stops a running container, remove it and run new container with newer yugabyte binaries.

Can you repeat the same thing in normal processes?

Yes, this is reproducible even if I download binaries from downloads.yugabyte.com and run yugabyted by hand from command line. Also I found that upgrading binaries not necessary to triggering error. Simple restart was enough. See reproducer below:

reproducer

```bash # On every node: wget https://downloads.yugabyte.com/releases/2.20.2.1/yugabyte-2.20.2.1-b3-linux-x86_64.tar.gz tar zxvf yugabyte-2.20.2.1-b3-linux-x86_64.tar.gz cd yugabyte-2.20.2.1 bin/post_install.sh # On node2 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.2 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} # On node1 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 # On node3 bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.3 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 ``` Then on node1 I press CTRL+C to stop yugabyted. Verify that no yugabyte processes left and then start exactly same command as was earlier: ```bash $ bin/yugabyted start --callhome=false --background=false --base_dir=/home/onyn/yugabyte_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2 ERROR: A node is already running on 192.168.0.2, please specify a valid address. For more information, check the logs in /home/onyn/yugabyte_data/logs ```

Also note that error told about node2 IP but I restart only node1.

Maybe the old container is still running or smth?

No. I double checked this.

ddorian commented 6 months ago

Can you do yugabyted collect_logs --base_dir=... on the one that failed to start and upload them here?

onyn commented 6 months ago

$ bin/yugabyted collect_logs --base_dir=/home/onyn/yugabyte_data

ERROR: No YugabyteDB node is running in the data_dir /home/onyn/yugabyte_data/data
For more information, check the logs in /home/onyn/yugabyte_data/logs

ddorian commented 6 months ago

You need to run it inside the docker container maybe?

onyn commented 6 months ago

Can you repeat the same thing in normal processes?

As you told, I continue experiments using normal process, not docker. Container is down because of ERROR: A node is already running on 192.168.0.2, please specify a valid address. Normal process down by same reasons. Why it's down is the main question of this ticket.

ddorian commented 6 months ago

@onyn is yugabyted process maybe still running when you get that error?

onyn commented 6 months ago

No. I triple checked this. As I said before, yugabyte instance on node1 refers to node2 in its error message. That's confuses me.

yugabyte / yugabyte-db

[yugabyted] ERROR: A node is already running on #21687

Description

Warning: Please confirm that this issue does not contain any sensitive information