xapi-project / xen-api

The Xapi Project's XenAPI Server
http://xenproject.org/developers/teams/xapi.html
Other
346 stars 283 forks source link

endless hang on dbsync (update_env) #3228

Closed leewin12 closed 3 years ago

leewin12 commented 7 years ago

Every VM in this server - singie pool, single master - works fine, but cannot access management via http(port 80) and xencenter.

I tried to restart xapi service, xe-toolstack-restart, but xapi is never listened by xapi service.

According to xensource.log, the starting procedure got stucked after starting up database engine Performing initial DB GC https://github.com/xapi-project/xen-api/blob/9324f70524ce69824fcab26bda8bb20ef5084e7c/ocaml/xapi/xapi.ml#L102

xensource.log

.......
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |server_init D:d8beed11d60d|startup] task [Listening
 unix socket]
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |Listening unix socket D:601e9a596ce8|http] Establis
hing Unix domain server on path: /var/lib/xcp/xapi
Sep 26 16:57:24 mxen04a xapi: [ info|mxen04a|0 |Listening unix socket D:601e9a596ce8|xapi] Successf
ully bound socket to: UNIX /var/lib/xcp/xapi
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |server_init D:d8beed11d60d|startup] task [starting
thread Metadata VDI liveness monitor]
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |server_init D:d8beed11d60d|startup] task [Checking
for non-HA redo-log]
.......
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |starting up database engine D:6ce164d4694b|xapi] About to flush database: /var/lib/xcp/state.db
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|19 dbflush [/var/lib/xcp/state.db]||sql] In memory DB
flushing thread created [/var/lib/xcp/state.db].
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |starting up database engine D:6ce164d4694b|sql] XML backend [/var/lib/xcp/state.db] -- Write buffer flushed. Time: 0.034829
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |starting up database engine D:6ce164d4694b|xapi] Performing initial DB GC
Sep 26 16:57:24 mxen04a xapi: [debug|mxen04a|0 |DB GC D:cdc4d773a4e6|db_gc] session_log: active_sessions=0 (0 pool, 0 anon, 0 named - 0 groups)
Sep 26 16:57:24 mxen04a xenopsd-xc: [debug|mxen04a|37 |events|hotplug] Checking to see whether /xapi/9d6380f3-6eac-5f34-4e78-d206525fee91/hotplug/36/vif/0/hotplug
......

And the task-list shows it is hang on somewhere in dbsync.update_env https://github.com/xapi-project/xen-api/blob/fe3dacf19fa179dbe4aae57c1b4e83e8dfab52bc/ocaml/xapi/dbsync.ml#L49

[root@mxen04a log]# xe task-list
uuid ( RO)                : 10ef57cf-3b62-3174-9ef4-4592713acdfe
          name-label ( RO): dbsync (update_env)
          name-description ( RO):
          status ( RO): pending
          progress ( RO): 0.000

here is result of netstat

tcp6       0      0 :::40769             :::*                    LISTEN      30517/rpc.statd
tcp6       0      0 :::111                 :::*                    LISTEN      30520/rpcbind
tcp6       0      0 :::22                   :::*                    LISTEN      27950/sshd
tcp6       0      0 :::37207             :::*                    LISTEN      -
tcp6       0      0 :::443                 :::*                    LISTEN      3242/stunnel

Please let me know if you need further infomation.

I always appreciate to you guys contribution on this awesome project.

cahl commented 5 years ago

Quite old ticket but did you ever find a solution to this? We're experiencing the exact same issue on xenserver 7.2 at the moment.

robhoes commented 5 years ago

My guess would be that the message-switch service is not running for some reason.

Try systemctl status message-switch to check this, and if not running, do systemctl start message-switch followed by xe-toolstack-restart.

cahl commented 5 years ago

@robhoes Yup, you're right. I ended up stracing the xapi process and it kept spamming the following:

select(0, 0x7ffefeeb29b0, 0x7ffefeeb2a30, 0x7ffefeeb2ab0, {5, 0}) = 0 (Timeout)
rt_sigprocmask(SIG_BLOCK, [VTALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
socket(PF_LOCAL, SOCK_STREAM, 0)        = 13
connect(13, {sa_family=AF_LOCAL, sun_path="/var/run/message-switch/sock"}, 30) = -1 ECONNREFUSED (Connection refused)
close(13)

Restarting the message-switch and toolstack resolved the issue.