xapi-project / xen-api

The Xapi Project's XenAPI Server
http://xenproject.org/developers/teams/xapi.html
Other
346 stars 284 forks source link

xapi not coming back on new master after `pool.designate_new_master`, old master loses TLS verification #6056

Open ydirson opened 2 weeks ago

ydirson commented 2 weeks ago

In this XCP-ng forum subthread, a user reports a pool breakage, which I was able to replicate as follows (not 100% repro, I got it on 2nd attempt after joining a 3rd host):

[15:19 xcp-ng-hqerhcgv ~]# for i in $(seq 3); do time xe pool-designate-new-master host-uuid=a46b65ee-6b9f-4b73-9792-7c968c0c7f20; done

real    0m24.878s
user    0m0.010s
sys 0m0.019s

real    0m4.389s
user    0m0.000s
sys 0m0.034s
Lost connection to the server.

real    0m2.931s
user    0m0.004s
sys 0m0.030s
[15:20 xcp-ng-hqerhcgv ~]# 

The Lost connection to the server. is likely key, as I saw it first when I got the issue.

Forum post has user's full logs, the following are selected from my reproduction.

daemon.log on the expected-new-master shows a failure to restart xapi successfully, looping a few times on:

Oct 14 15:20:59 xcp-ng-hqerhcgv xapi-init[1244028]: Stopping xapi: [  OK  ]
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: Unit xapi.service entered failed state.
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: xapi.service failed.
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: xapi.service holdoff time over, scheduling restart.
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit is masked.
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit is masked.
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: Starting XenAPI server (XAPI)...
Oct 14 15:20:59 xcp-ng-hqerhcgv systemd[1]: Started XenAPI server (XAPI).
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi-init[1244047]: Starting xapi:
Oct 14 15:21:00 xcp-ng-hqerhcgv systemd[1]: xapi.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 14 15:21:00 xcp-ng-hqerhcgv xapi-init[1244078]: Stopping xapi: [  OK  ]
...
Oct 14 15:21:04 xcp-ng-hqerhcgv systemd[1]: start request repeated too quickly for xapi.service
Oct 14 15:21:04 xcp-ng-hqerhcgv systemd[1]: Failed to start XenAPI server (XAPI).
Oct 14 15:21:04 xcp-ng-hqerhcgv systemd[1]: Unit xapi.service entered failed state.
Oct 14 15:21:04 xcp-ng-hqerhcgv systemd[1]: xapi.service failed.
Oct 14 15:21:18 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1240697 cleaning up
Oct 14 15:21:30 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1244047 cleaning up
Oct 14 15:21:31 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1244092 cleaning up
Oct 14 15:21:32 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1244140 cleaning up
Oct 14 15:21:33 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1244193 cleaning up
Oct 14 15:21:34 xcp-ng-hqerhcgv message-switch[641]: main: [ info|message-switch] Session xapi:1244247 cleaning up

with xensource.log showing:

Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||225 ||xapi_pool_transition] The old master has not restarted yet. Sleep for 3 seconds
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||115 HTTPS 10.1.132.1->:::80|host.request_backup D:1d75ded0bc49|xapi] About to flush database: /var/lib/xcp/state.db
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [ info||115 HTTPS 10.1.132.1->:::80|host.request_backup D:1d75ded0bc49|redo_debug] Flushing database to all active redo-logs
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||115 HTTPS 10.1.132.1->:::80|host.request_backup D:1d75ded0bc49|sql] XML backend [/var/lib/xcp/state.db] -- Write buffer flushed. Time: 0.093666
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.2; 
peer addresses = [ 10.1.132.1; 10.1.132.3 ]
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Proposing myself as a new master to host address: 10.1.132.1
Oct 14 15:20:57 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.1:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.
pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243969): -> 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243969 (not cached) connected to 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.1:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.
pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243973): -> 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243973 (not cached) connected to 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.1:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243977): -> 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243977 (not cached) connected to 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Proposing myself as a new master to host address: 10.1.132.3
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.3:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243981): -> 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243981 (not cached) connected to 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.3:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243985): -> 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243985 (not cached) connected to 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.3:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243989): -> 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1243989 (not cached) connected to 10.1.132.3:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Proposing myself as a new master to host address: 10.1.132.2
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.2:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1243993): -> 10.1.132.2:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1244005 (not cached) connected to 10.1.132.2:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] No-one objected to the proposal. Any errors from here on will result in thi
s node entering the 'broken' state
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Phase 2: committing transaction
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xapi_pool_transition] Signalling commit to host address: 10.1.132.1
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.1:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1244011): -> 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1244011 (not cached) connected to 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] client cert verification 10.1.132.1:443: SNI=pool path=/etc/stunnel/xapi-pool-ca-bundle.pem
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] stunnel start
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|stunnel] Started a client (pid:1244015): -> 10.1.132.1:443
Oct 14 15:20:58 xcp-ng-hqerhcgv xapi: [debug||79 HTTPS 10.1.132.1->:::80|pool.designate_new_master R:f29c1d69d97b|xmlrpc_client] stunnel pid: 1244015 (not cached) connected to 10.1.132.1:443
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [ warn||180 |watching networks for NBD-related changes D:5228a098d303|xmlrpc_client] stunnel pid: 1242711 caught Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = DummyRef:|5228a098-d303-4215-d007-1262b74182b6|watching networks for NBD-related changes; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }")
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [ warn||179 |timeboxed_rpc D:12ea59e27c8b|xmlrpc_client] stunnel pid: 1241311 caught Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = DummyRef:|221fabee-3cce-498d-5d77-1a2a0ddc0e41|xapi events; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }")
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [debug||180 |watching networks for NBD-related changes D:5228a098d303|stunnel] check_reusable: caught exception Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = ; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }"); assuming not reusable
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [debug||180 |watching networks for NBD-related changes D:5228a098d303|stunnel] get_reusable_stunnel: Found non-reusable stunnel in the cache. disconnecting from 10.1.132.1:443
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] timeboxed_rpc D:12ea59e27c8b failed with exception Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = DummyRef:|221fabee-3cce-498d-5d77-1a2a0ddc0e41|xapi events; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }")
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] Raised Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = DummyRef:|221fabee-3cce-498d-5d77-1a2a0ddc0e41|xapi events; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }")
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 1/20 xapi Raised at file ocaml/libs/http-lib/http_client.ml, line 199
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 2/20 xapi Called from file ocaml/libs/http-lib/http_client.ml, line 219
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 3/20 xapi Called from file ocaml/libs/http-lib/xmlrpc_client.ml, line 384
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 4/20 xapi Called from file ocaml/libs/http-lib/xmlrpc_client.ml, line 365
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 5/20 xapi Called from file ocaml/libs/http-lib/xmlrpc_client.ml, line 371
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 6/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 7/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 8/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 9/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 10/20 xapi Called from file ocaml/libs/http-lib/xmlrpc_client.ml, line 194
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 11/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 12/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 13/20 xapi Called from file option.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 14/20 xapi Called from file ocaml/libs/http-lib/xmlrpc_client.ml, line 192
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 15/20 xapi Called from file ocaml/xapi/helpers.ml, line 449
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 16/20 xapi Called from file ocaml/xapi/server_helpers.ml, line 67
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 17/20 xapi Called from file ocaml/xapi/server_helpers.ml, line 94
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 18/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 19/20 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 20/20 xapi Called from file ocaml/libs/log/debug.ml, line 250
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace]
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] Raised Http_client.Http_request_rejected("{ frame = false; method = POST; uri = /; query = [  ]; content_length = [  ]; transfer encoding = ; version = 1.1; cookie = [ (value filtered) ]; task = ; subtask_of = DummyRef:|221fabee-3cce-498d-5d77-1a2a0ddc0e41|xapi events; content-type = ; host = ; user_agent = xen-api-libs/24.19.2; traceparent =  }")
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 1/6 xapi Raised at file ocaml/libs/log/debug.ml, line 267
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 2/6 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 24
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 3/6 xapi Called from file ocaml/libs/xapi-stdext/lib/xapi-stdext-pervasives/pervasiveext.ml, line 39
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 4/6 xapi Called from file ocaml/xapi-client/client.ml, line 14
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 5/6 xapi Called from file ocaml/xapi-client/client.ml, line 3804
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace] 6/6 xapi Called from file ocaml/xapi/xapi_xenops.ml, line 3290
Oct 14 15:20:59 xcp-ng-hqerhcgv xapi: [error||179 |xapi events D:221fabee3cce|backtrace]

Meanwhile, the old-master while restarting xapi reveals a strange situation, where it has lost its tls-enablement status:

Oct 14 15:20:45 xcpng83-bzkcpvhy xsh: [ warn||0 ||xsh] TLS verification is disabled on this host: /var/xapi/verify-certificates is absent
Oct 14 15:20:45 xcpng83-bzkcpvhy xsh: [ info||0 ||Stunnel_client] disabling default tls verification
Oct 14 15:20:45 xcpng83-bzkcpvhy xsh: [debug||0 ||stunnel] client cert verification 10.1.132.2:443: None
Oct 14 15:20:45 xcpng83-bzkcpvhy xsh: [debug||0 ||stunnel] Started a client (pid:1124004): -> 10.1.132.2:443

Digging the log reveals that this first occurred last friday, when I was (without success) trying to reproduce the issue with only 2 hosts in the pool.

ydirson commented 2 weeks ago

Checked on a brand new 8.2.1 pool, similar but not identical behavior: from the master I can launch a quick sequence of designate-new-master without even seeing a second run being logged (where on 8.3 I was getting a second run):

[18:11 host2 ~]# grep -i designate.*phase /var/log/xensource.log 
Oct 16 18:09:29 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.32; peer addresses = [ 10.1.132.33; 10.1.132.31 ]
Oct 16 18:09:29 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:09:30 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:09:30 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
Oct 16 18:09:42 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2.2: setting flag to make us restart when the connection to the master dies
[18:12 host2 ~]# 

And running a new sequence from the same machine (now a slave) to switch it back to master, I get the same result:

[18:12 host1 ~]# time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9 ; time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9 ; time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9

real    0m11.634s
user    0m0.004s
sys 0m0.028s
Lost connection to the server.

real    0m3.486s
user    0m0.012s
sys 0m0.035s
Lost connection to the server.

real    0m0.144s
user    0m0.001s
sys 0m0.028s
[18:12 host1 ~]# grep -i designate.*phase /var/log/xensource.log 
Oct 16 18:12:46 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.31; peer addresses = [ 10.1.132.32; 10.1.132.33 ]
Oct 16 18:12:46 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:12:47 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:12:47 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
Oct 16 18:12:56 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2.2: setting flag to make us restart when the connection to the master dies
Oct 16 18:12:57 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.31; peer addresses = [ 10.1.132.32; 10.1.132.33 ]
Oct 16 18:12:57 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:12:59 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:12:59 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
[18:16 host1 ~]# pidof xapi
[18:16 host1 ~]# 

daemon.log:

Oct 16 18:13:06 host1 systemd[1]: Starting XenAPI server (XAPI)...
Oct 16 18:13:06 host1 systemd[1]: Started XenAPI server (XAPI).
Oct 16 18:13:06 host1 xapi-init[12845]: Starting xapi:
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.863Z||584|About to bind to /var/run/nonpersistent/forkexecd//fd_34399add-58e8-4ee5-9fe2-b253d584dc3e\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.863Z||584|bound, listening\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.874Z||12885|Child here!\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Grandchild here!\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Started: state.cmdargs = [/usr/bin/systemctl;restart;stunnel@xapi]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Started: state.env = [PATH=/sbin:/usr/sbin:/bin:/usr/bin]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Selecting in handle_comms_no_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|fd sock\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|fd sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Received fd named: 7db44ef4-5f1c-49cc-a1fa-21690803867a - duping to 1 (from 10)\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|fd sock2\x0A
Oct 16 18:13:06 host1 systemd[1]: xapi.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Received fd named: 42d68842-3582-455e-9e2b-eaedd538ecea - duping to 2 (from 10)\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|comms sock\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Exec\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Finished...\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Args after replacement = [/usr/bin/systemctl;restart;stunnel@xapi]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|I've received the following fds: [2;1]\x0A\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.930Z||12886|Caught unexpected exception: End_of_file\x0A
Oct 16 18:13:06 host1 systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit is masked.
Oct 16 18:13:06 host1 systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit is masked.
Oct 16 18:13:06 host1 systemd[1]: Stopping TLS tunnel for xapi...
Oct 16 18:13:06 host1 systemd[1]: Starting TLS tunnel for xapi...
Oct 16 18:13:07 host1 systemd[1]: Started TLS tunnel for xapi.
Oct 16 18:13:07 host1 xapi-init[12891]: Stopping xapi: [  OK  ]
Oct 16 18:13:07 host1 systemd[1]: Unit xapi.service entered failed state.
Oct 16 18:13:07 host1 systemd[1]: xapi.service failed.