xapi-project / xapi-nbd

Expose XenServer disks conveniently over NBD
Other
4 stars 11 forks source link

Fails to communicate with xapi at startup #19

Closed gaborigloi closed 7 years ago

gaborigloi commented 7 years ago

From https://github.com/xapi-project/xapi-nbd/pull/18 :

When I rebooted the machine, I got ECONNREFUSED, probably when xapi-nbd tried to connect to xapi's Unix domain socket:

Sep 28 18:30:33 localhost xapi-nbd[2843]: Caught: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: main: Caught unexpected exception: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: xapi-nbd: internal error, uncaught exception:
Sep 28 18:30:33 localhost xapi-nbd[2843]: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 18:30:33 localhost xapi-nbd[2843]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 18:30:33 localhost xapi-nbd[2843]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 18:30:33 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 18:30:33 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 18:30:33 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 18:30:34 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.

When I started it manually later, it worked, and it cleaned up the leaked VDIs properly. This issue happens consistently at startup: the NBD server will try to make XenAPI calls via the Unix domain socket and will fail.

Even if we make the server path-activated by both the unix domain socket of xapi and the .pem file, there are still transient errors:

Sep 28 21:42:41 localhost xapi-nbd[2605]: Caught: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: main: Caught unexpected exception: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: xapi-nbd: internal error, uncaught exception:
Sep 28 21:42:41 localhost xapi-nbd[2605]: Unix.Unix_error(Unix.ECONNREFUSED, "connect", "")
Sep 28 21:42:41 localhost xapi-nbd[2605]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 21:42:41 localhost xapi-nbd[2605]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 21:42:41 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.
Sep 28 21:42:41 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit lvm2-activation.service is masked.
Sep 28 21:42:41 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit lvm2-activation-early.service is
masked.
Sep 28 21:42:41 localhost systemd[1]: start request repeated too quickly for xapi-nbd.service
Sep 28 21:42:41 localhost systemd[1]: Failed to start NBD server that exposes XenServer disks.
Sep 28 21:42:41 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:41 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:41 localhost ntpd[2475]: 0.0.0.0 c615 05 clock_sync
Sep 28 21:42:45 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation.service, ignoring: Unit lvm2-activation.service is masked.
Sep 28 21:42:45 localhost systemd[1]: Cannot add dependency job for unit lvm2-activation-early.service, ignoring: Unit lvm2-activation-early.service is
masked.
Sep 28 21:42:45 localhost systemd[1]: Started NBD server that exposes XenServer disks.
Sep 28 21:42:45 localhost systemd[1]: Starting NBD server that exposes XenServer disks...
Sep 28 21:42:45 localhost xapi-nbd[2778]: main: Starting xapi-nbd: port = '10809'; certfile = '/etc/xensource/xapi-ssl.pem'; ciphersuites = '!EXPORT:RSA
+AES128-SHA256' no_tls = 'false'
Sep 28 21:42:46 localhost xapi-nbd[2778]: main: Caught unexpected exception: Server_error(INTERNAL_ERROR, [ missing table; host;  ])
Sep 28 21:42:46 localhost xapi-nbd[2778]: xapi-nbd: internal error, uncaught exception:
Sep 28 21:42:46 localhost xapi-nbd[2778]: Server_error(INTERNAL_ERROR, [ missing table; host;  ])
Sep 28 21:42:46 localhost xapi-nbd[2778]: Raised at file "src/core/lwt.ml", line 805, characters 16-23
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/main.ml", line 135, characters 2-181
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner_term.ml", line 27, characters 19-24
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner.ml", line 27, characters 27-34
Sep 28 21:42:46 localhost xapi-nbd[2778]: Called from file "src/cmdliner.ml", line 106, characters 32-39
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service: main process exited, code=exited, status=1/FAILURE
Sep 28 21:42:46 localhost systemd[1]: Unit xapi-nbd.service entered failed state.
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service failed.
Sep 28 21:42:46 localhost systemd[1]: xapi-nbd.service holdoff time over, scheduling restart.
Sep 28 21:42:46 localhost systemd[1]: Started NBD server that exposes XenServer disks.
Sep 28 21:42:46 localhost systemd[1]: Starting NBD server that exposes XenServer disks...
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Starting xapi-nbd: port = '10809'; certfile = '/etc/xensource/xapi-ssl.pem'; ciphersuites = '!EXPORT:RSA+AES128-SHA256' no_tls = 'false'
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Checking if there are any VBDs to clean up that leaked during the previous run
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Initialising TLS
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Setting up server socket
Sep 28 21:42:46 localhost xapi-nbd[2805]: main: Listening for incoming connections

So it seems that xapi is not yet completely ready when its Unix domain socket is created (Sep 28 21:42:46 localhost xapi-nbd[2778]: Server_error(INTERNAL_ERROR, [ missing table; host; ])). So I'll just add a loop to the startup of xapi-nbd that will try logging in until it succeeds, up to a timeout, to ensure that xapi is up when we need it.

gaborigloi commented 7 years ago

I have observed the exact same transient issues as above when I made the service path-activated by all these files:

[Path]
PathExists=/var/xapi/xapi
PathExists=/etc/xensource/xapi-ssl.pem
PathExists=/var/run/xapi_init_complete.cookie

Probably because these path conditions are ORed, not ANDed together.

gaborigloi commented 7 years ago

Indeed, it seems that the path conditions are ORed together :( : With this xapi-nbd.path file, it started up without errors after reboot:

[Path]
PathExists=/var/run/xapi_init_complete.cookie

However, we cannot use this solution, because we need multiple paths to exist, we also need a path condition for the .pem file, because that's created independently of xapi's startup sequence. And it's not safe to combine them, since they are ORed together. For some reason it worked, even when I removed the .pem file and rebooted. However, I think it's not safe to assume that we have a management network & IP & have generated the .pem file that xapi-nbd needs by the time this file appears, because I've observed that these things happen a few seconds after xapi's startup sequence.