xapi-project / xen-api

The Xapi Project's XenAPI Server
http://xenproject.org/developers/teams/xapi.html
Other
345 stars 285 forks source link

Mysterious failure in HA with HOST_NOT_ENOUGH_FREE_MEMORY #4323

Open olivierlambert opened 3 years ago

olivierlambert commented 3 years ago

Hi there!

So I'm reporting an issue I can easily reproduce in XCP-ng and CH 8.2 [1]. It's pretty easy to trigger, but the error message is completely… unrelated.

To trigger the issue, be sure to have:

  1. HA enable at pool level
  2. One VM not HA protected on a host

Now try to do a host.evacuate. It will fail with HOST_NOT_ENOUGH_FREE_MEMORY, regardless the amount of memory available. This does not happen if the VM is HA protected.

Obviously, we did the test on a 1GiB VM while having 100GiB+ free memory on all hosts.

We also have public reports about this: https://xcp-ng.org/forum/post/36829 but also private tickets reporting the same behaviour.

Here is a trace as an example (host.restart is doing host.evacuate anyway):

host.restart
{
  "id": "df35f1fe-ecba-4fb0-b4c6-1a9db0efac2b",
  "force": false
}
{
  "code": "HOST_NOT_ENOUGH_FREE_MEMORY",
  "params": [
    "OpaqueRef:a399f13f-ca17-44c7-a2de-4c25d0319293"
  ],
  "task": {
    "uuid": "48724829-015b-3de1-4542-5fb21b56fd12",
    "name_label": "Async.host.evacuate",
    "name_description": "",
    "allowed_operations": [],
    "current_operations": {},
    "created": "20210219T23:36:59Z",
    "finished": "20210219T23:36:59Z",
    "status": "failure",
    "resident_on": "OpaqueRef:f6b6a08e-f323-4883-82a4-7418a4780633",
    "progress": 1,
    "type": "<none/>",
    "result": "",
    "error_info": [
      "HOST_NOT_ENOUGH_FREE_MEMORY",
      "OpaqueRef:a399f13f-ca17-44c7-a2de-4c25d0319293"
    ],
    "other_config": {},
    "subtask_of": "OpaqueRef:NULL",
    "subtasks": [],
    "backtrace": "(((process xapi)(filename ocaml/xapi/xapi_host.ml)(line 560))((process xapi)(filename hashtbl.ml)(line 266))((process xapi)(filename hashtbl.ml)(line 272))((process xapi)(filename hashtbl.ml)(line 277))((process xapi)(filename ocaml/xapi/xapi_host.ml)(line 556))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 231))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 103)))"
  },
  "message": "HOST_NOT_ENOUGH_FREE_MEMORY(OpaqueRef:a399f13f-ca17-44c7-a2de-4c25d0319293)",
  "name": "XapiError",
  "stack": "XapiError: HOST_NOT_ENOUGH_FREE_MEMORY(OpaqueRef:a399f13f-ca17-44c7-a2de-4c25d0319293)
    at Function.wrap (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/_XapiError.js:16:12)
    at _default (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/_getTaskResult.js:11:29)
    at Xapi._addRecordToCache (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/index.js:866:24)
    at forEach (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/index.js:887:14)
    at Array.forEach (<anonymous>)
    at Xapi._processEvents (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/index.js:877:12)
    at Xapi._watchEvents (/usr/local/lib/node_modules/xo-server/node_modules/xen-api/src/index.js:1038:14)"
}

[1]: it's likely the issue is affect far older version of XCP-ng and XenServer, as if it was always there before. See this old issue which reported the exact same symptom 5 years ago: https://github.com/vatesfr/xen-orchestra/issues/1351

benjamreis commented 3 years ago

So by design XAPi won't try to evacuate non HA protected VM.

But the error message is quite misleading: HOST_NOT_ENOUGH_FREE_MEMORY. IMHO a specific error message would be much better and more understable from a user POV. Something like VM_NOT_HA_PROTECTED_CANT_EVACUATE.

What do you think? I think I talked about it with @psafont

psafont commented 3 years ago

I think having a specific message makes sense, flipping the what and the why in the name makes sense to me: VM_CANT_EVACUATE_NOT_PROTECTED @robhoes any better ideas on the name of the error?

robhoes commented 3 years ago

I think that we first need to understand where the error comes from. I wouldn't expect an error at all. A request to evacuate a host should just do that, regardless of the HA protection status of the VMs on that host.

benjamreis commented 3 years ago

I think the reasoning is when HA is enabled, you need to save RAM for HA protected VMs when evacuating. From my understanding it was a conscious choice evn though imho you still should try to migrate other vms after the protected ones.

I 'll try to point at the code why the error is thrown error here but I need to search a bit as it was not easy to figure out.

VGerris commented 1 year ago

I still get this message on a recent 8.3 installation. It surprises me, because there is plenty of memory for all VMs on either host. The error also does not show the host so it's cryptic in that way. Perhaps an improvement could be that when HA is not enabled the memory would be calculated, or the error could be improved to indicate that HA is not enabled for a host ( and which one ).