Open ydirson opened 1 month ago
Checked on a brand new 8.2.1 pool, similar but not identical behavior: from the master I can launch a quick sequence of designate-new-master
without even seeing a second run being logged (where on 8.3 I was getting a second run):
[18:11 host2 ~]# grep -i designate.*phase /var/log/xensource.log
Oct 16 18:09:29 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.32; peer addresses = [ 10.1.132.33; 10.1.132.31 ]
Oct 16 18:09:29 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:09:30 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:09:30 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
Oct 16 18:09:42 host2 xapi: [debug||207 HTTPS 10.1.132.31->:::80|pool.designate_new_master R:6cc538e4abe2|xapi_pool_transition] Phase 2.2: setting flag to make us restart when the connection to the master dies
[18:12 host2 ~]#
And running a new sequence from the same machine (now a slave) to switch it back to master, I get the same result:
[18:12 host1 ~]# time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9 ; time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9 ; time xe pool-designate-new-master host-uuid=7978ca38-0c46-47f2-845b-1f0038c69fd9
real 0m11.634s
user 0m0.004s
sys 0m0.028s
Lost connection to the server.
real 0m3.486s
user 0m0.012s
sys 0m0.035s
Lost connection to the server.
real 0m0.144s
user 0m0.001s
sys 0m0.028s
[18:12 host1 ~]# grep -i designate.*phase /var/log/xensource.log
Oct 16 18:12:46 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.31; peer addresses = [ 10.1.132.32; 10.1.132.33 ]
Oct 16 18:12:46 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:12:47 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:12:47 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
Oct 16 18:12:56 host1 xapi: [debug||61 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:5a0d04895229|xapi_pool_transition] Phase 2.2: setting flag to make us restart when the connection to the master dies
Oct 16 18:12:57 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] attempting manual two-phase commit of new master. My address = 10.1.132.31; peer addresses = [ 10.1.132.32; 10.1.132.33 ]
Oct 16 18:12:57 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 1: proposing myself as new master
Oct 16 18:12:59 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 2: committing transaction
Oct 16 18:12:59 host1 xapi: [debug||122 HTTPS 10.1.132.32->:::80|pool.designate_new_master R:aafe3f5d52a5|xapi_pool_transition] Phase 2.1: telling everyone but me to commit
[18:16 host1 ~]# pidof xapi
[18:16 host1 ~]#
daemon.log
:
Oct 16 18:13:06 host1 systemd[1]: Starting XenAPI server (XAPI)...
Oct 16 18:13:06 host1 systemd[1]: Started XenAPI server (XAPI).
Oct 16 18:13:06 host1 xapi-init[12845]: Starting xapi:
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.863Z||584|About to bind to /var/run/nonpersistent/forkexecd//fd_34399add-58e8-4ee5-9fe2-b253d584dc3e\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.863Z||584|bound, listening\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.874Z||12885|Child here!\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Grandchild here!\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Started: state.cmdargs = [/usr/bin/systemctl;restart;stunnel@xapi]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.890Z||12886|Started: state.env = [PATH=/sbin:/usr/sbin:/bin:/usr/bin]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Selecting in handle_comms_no_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|fd sock\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.891Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|fd sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Received fd named: 7db44ef4-5f1c-49cc-a1fa-21690803867a - duping to 1 (from 10)\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|fd sock2\x0A
Oct 16 18:13:06 host1 systemd[1]: xapi.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Received fd named: 42d68842-3582-455e-9e2b-eaedd538ecea - duping to 2 (from 10)\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Selecting in handle_comms_with_fd_sock2\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|Done\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.892Z||12886|comms sock\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Exec\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Finished...\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|Args after replacement = [/usr/bin/systemctl;restart;stunnel@xapi]\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.893Z||12886|I've received the following fds: [2;1]\x0A\x0A
Oct 16 18:13:06 host1 forkexecd: [ warn||0 ||forkexecd] 20241016T16:13:06.930Z||12886|Caught unexpected exception: End_of_file\x0A
Oct 16 18:13:06 host1 systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit is masked.
Oct 16 18:13:06 host1 systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit is masked.
Oct 16 18:13:06 host1 systemd[1]: Stopping TLS tunnel for xapi...
Oct 16 18:13:06 host1 systemd[1]: Starting TLS tunnel for xapi...
Oct 16 18:13:07 host1 systemd[1]: Started TLS tunnel for xapi.
Oct 16 18:13:07 host1 xapi-init[12891]: Stopping xapi: [ OK ]
Oct 16 18:13:07 host1 systemd[1]: Unit xapi.service entered failed state.
Oct 16 18:13:07 host1 systemd[1]: xapi.service failed.
In this XCP-ng forum subthread, a user reports a pool breakage, which I was able to replicate as follows (not 100% repro, I got it on 2nd attempt after joining a 3rd host):
xe pool-designate-new-master
against the samehost-uuid
several times rapidly enough to run afterDESIGNATE_NEW_MASTER_IN_PROGRESS
stops to be returned but apparently before xapi realizes that host is already the masterThe
Lost connection to the server.
is likely key, as I saw it first when I got the issue.Forum post has user's full logs, the following are selected from my reproduction.
daemon.log
on the expected-new-master shows a failure to restartxapi
successfully, looping a few times on:with xensource.log showing:
Meanwhile, the old-master while restarting
xapi
reveals a strange situation, where it has lost its tls-enablement status:Digging the log reveals that this first occurred last friday, when I was (without success) trying to reproduce the issue with only 2 hosts in the pool.