xapi-project / xen-api

The Xapi Project's XenAPI Server
http://xenproject.org/developers/teams/xapi.html
Other
346 stars 283 forks source link

`vm-checkpoint` task indefinitely stuck while interacting with GC #6032

Open ydirson opened 4 days ago

ydirson commented 4 days ago

During a vm-checkpoint on XCP-ng 8.3 (so using xcp-emu-manager), I got a case of xe vm-checkpoint never returning. According to the logs xenopsd got non-responsive but we fail to see why. The log shows a SR GC between the checkpoint start and its failure, featuring errors of its own, involving the VDI holding the VM we're attempting to checkpoint.

[10:13 host1 ~]# date
Wed Oct  2 10:13:41 CEST 2024
[10:13 host1 ~]# xe task-list uuid=42646262-60ba-6896-c759-e86f9551ee83 params=name-label,status,progress,created
name-label ( RO)    : VM.checkpoint
        status ( RO): pending
      progress ( RO): 0.056
       created ( RO): 20241001T11:11:07Z
[10:25 host1 ~]# xe vm-list  uuid=069bf5db-5e87-0f51-322d-901f8a01a742 params=VBDs
VBDs (SRO)    : 800c783f-5511-00a9-804e-76de14e89bcf; 0047698a-8f07-6366-7d30-ccaf7f2b5293

[10:25 host1 ~]# xe vbd-list uuid=0047698a-8f07-6366-7d30-ccaf7f2b5293 params=vdi-uuid 
vdi-uuid ( RO)    : 371e7067-d032-49ec-9dd1-552e0c5c68a9

The problem seems to be manyfold:

xsensource.log daemon.log SMlog

edwintorok commented 4 days ago

SMGC claims to have finished at 13:11:27, so not sure what it was doing until 13:13 that it timed out:

Oct  1 13:11:27 host1 systemd[1]: Started Garbage Collector for SR e6e40ee6-0491-0c3f-186c-db3d00a623a7.
Oct  1 13:13:14 host1 emu-manager-4[16196]: Failed to read from xenopsd because timeout reached.

Do you have more logs for that period? The SM log seems to be truncated, or was there really no more activity there?

ydirson commented 4 days ago

Hm I cannot rule out a copypaste mistake. Will try to see if I still get the full logfiles, otherwise will upload the next occurrence (already saw this 3 times in 2 days, pretty confident I can reproduce)