Open onyn opened 6 months ago
Hi @onyn
After upgrading, node1 failed to start with cryptic message ERROR: A node is already running on 192.168.0.2, please specify a valid address.:
Which process printed this error and what command triggered it?
Command was bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2
. So I think this is triggered by yugabyted. You may find details in the full log
spoiler.
Are you using the same volume for all 3 nodes? I would expect to have different volumes/paths.
Every yugabyte instance runs on its own bare metal server. So their data do not clash.
What did you do by "upgrade" here? Meaning, just restart the container?
I mean running docker compose up -d
command after specifying new image version in docker compose config. This command stops a running container, remove it and run new container with newer yugabyte binaries.
Can you repeat the same thing in normal processes?
Yes, this is reproducible even if I download binaries from downloads.yugabyte.com and run yugabyted by hand from command line. Also I found that upgrading binaries not necessary to triggering error. Simple restart was enough. See reproducer below:
Also note that error told about node2 IP but I restart only node1.
Maybe the old container is still running or smth?
No. I double checked this.
Can you do yugabyted collect_logs --base_dir=...
on the one that failed to start and upload them here?
$ bin/yugabyted collect_logs --base_dir=/home/onyn/yugabyte_data
ERROR: No YugabyteDB node is running in the data_dir /home/onyn/yugabyte_data/data
For more information, check the logs in /home/onyn/yugabyte_data/logs
You need to run it inside the docker container maybe?
Can you repeat the same thing in normal processes?
As you told, I continue experiments using normal process, not docker. Container is down because of ERROR: A node is already running on 192.168.0.2, please specify a valid address.
Normal process down by same reasons. Why it's down is the main question of this ticket.
@onyn is yugabyted
process maybe still running when you get that error?
No. I triple checked this. As I said before, yugabyte instance on node1
refers to node2
in its error message. That's confuses me.
Jira Link: DB-10571
Description
Playing around with yugabytedb. Created 3-node universe with nodes running in docker:
node1.yml
```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.1" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```node2.yml
```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.2" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" ```node3.yml
```yaml version: '3.7' services: yugabyte: image: yugabytedb/yugabyte:2.20.2.0-b145 network_mode: host volumes: - /var/yugabyte:/home/yugabyte/yb_data command: - "bin/yugabyted" - "start" - "--callhome=false" - "--background=false" - "--base_dir=/home/yugabyte/yb_data" - "--advertise_address=192.168.0.3" - "--ysql_enable_auth=true" - "--use_cassandra_authentication=true" - "--cloud_location=metal.de.rack1" - "--tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256}" - "--join=192.168.0.2" ```Universe formed successfully. Several days after I noticed new version of yugabyte docker image (2.20.2.1-b3). And tried to upgrade universe to new version. Started from node1 (follower):
node1.yml
```diff services: yugabyte: - image: yugabytedb/yugabyte:2.20.2.0-b145 + image: yugabytedb/yugabyte:2.20.2.1-b3 network_mode: host volumes: ```After upgrading, node1 failed to start with cryptic message
ERROR: A node is already running on 192.168.0.2, please specify a valid address.
:full log
``` [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Running yugabyted command: 'bin/yugabyted start --callhome=false --background=false --base_dir=/home/yugabyte/yb_data --advertise_address=192.168.0.1 --ysql_enable_auth=true --use_cassandra_authentication=true --cloud_location=metal.de.rack1 --tserver_flags=ysql_pg_conf_csv={password_encryption=scram-sha-256} --join=192.168.0.2' [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | cmd = start using config file: /home/yugabyte/yb_data/conf/yugabyted.conf (args.config=None) [yugabyted start] 2024-03-26 11:39:28,448 INFO: | 0.0s | Found directory /home/yugabyte/bin for file openssl_proxy.sh [yugabyted start] 2024-03-26 11:39:28,449 INFO: | 0.0s | Found directory /home/yugabyte/bin for file yb-admin [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Fetching configs from join IP... [yugabyted start] 2024-03-26 11:39:28,451 INFO: | 0.0s | Trying to get masters information from http://192.168.0.2:9000/api/v1/masters (Timeout=60) [yugabyted start] 2024-03-26 11:39:28,458 DEBUG: | 0.0s | Tserver 192.168.0.2 returned the followingmaster leader 192.168.0.2. [yugabyted start] 2024-03-26 11:39:28,461 ERROR: | 0.0s | ERROR: A node is already running on 192.168.0.2, please specify a valid address. For more information, check the logs in /home/yugabyte/yb_data/logs ```curl http://192.168.0.2:9000/api/v1/masters
```json { "master_server_and_type": [ { "master_server": "192.168.0.1:7100", "is_leader": false }, { "master_server": "192.168.0.2:7100", "is_leader": true }, { "master_server": "192.168.0.3:7100", "is_leader": false } ] } ```Rolling back to 2.20.2.0-b145 does not fix problem. Still fails with same error message.
What's wrong with my setup?
Warning: Please confirm that this issue does not contain any sensitive information