CP-49368 introduces a flag to xenstore before suspend Linux guest

LunfanZhang commented 3 months ago

Systems under heavy disk I/O load may encounter suspend failures due to the inadequacy of the default process freeze timeout settings. This PR proposes a solution, introduces a flag /control/suspend_initiate for guest agent to dynamic adjustment of freeze timeouts or configurations before suspending, enhancing the robustness of the suspend process and potentially reducing migration failures.

edwintorok commented 3 months ago

Would be useful if the guest agent was also able to report back failure to suspend, e.g. on another xenstore key. Otherwise we need to wait for the very long xenopsd timeout, even though the suspend would've failed almost immediately. Increasing the timeout may be useful, but is not guaranteed to succeed.

psafont commented 3 months ago

What's the lifecycle of the entry, what's the protocol to clear the flag when the suspend works, and when there's a failure?

LunfanZhang commented 3 months ago

Would be useful if the guest agent was also able to report back failure to suspend, e.g. on another xenstore key. Otherwise we need to wait for the very long xenopsd timeout, even though the suspend would've failed almost immediately. Increasing the timeout may be useful, but is not guaranteed to succeed.

it`s good suggestion to report back the failure of suspend, but the challenge is that suspend will freeze all the user space process which also include the guest agent, so it is hard to detect the failure from perspective of guest agent as well. if the suspend is blocked we cannot ensure the guest agent is running.. Increasing the timeout is helpful has been approved, currently, some xenrt testcase logic do the similar thing but hardcode to extend timeout.

LunfanZhang commented 3 months ago

What's the lifecycle of the entry, what's the protocol to clear the flag when the suspend works, and when there's a failure?

the entry can be reset or removed from guest after resume, but I am not sure if it is necessary, for this case, reducing migration failures, guest agent simply needs a notification indicating the imminent suspension of processes by the kernel. sorry I do not get the point of "when there's a failure", if it refers to this xenstore write operation, then the guest will try to freeze the process with default value, in most of case, which typically does not pose a problem

psafont commented 3 months ago

sorry I do not get the point of "when there's a failure",

I just want to know the usecase where the flag is involved and how the communication between xenopsd and the guest happen in your design.

LunfanZhang commented 3 months ago

sorry I do not get the point of "when there's a failure",

I just want to know the usecase where the flag is involved and how the communication between xenopsd and the guest happen in your design.

The design is straightforward, the guest agent will start with watching this flag, Upon detecting a change triggered by a write operation here, guest agent will receive an event to indicate a suspending, then the guest agent will do something like extend processes freeze timeout, after the suspend is done, and the guest resume again, the guest agent will receive a resume signal and reset the processes freeze timeout to the default value and reset the flag here maybe( from perspective of guest, the flag's actual value is less important than receiving timely notifications to adjust the timeout, that`s why I said I am not sure if it is necessary above).

gangj commented 3 months ago

The script coverage failed: https://app.codecov.io/gh/xapi-project/xen-api/pull/5727?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=xapi-project :

Coverage data is based on HEAD bab8a79 compared to BASE c0e5dc4 BASE commit is 174 commits behind HEAD on master 231b94d

@LunfanZhang I think you can first rebase your commit to the latest master to see the result.

gangj commented 3 months ago

Would be better to add the context in the commit message.

gangj commented 3 months ago

the guest agent will receive a resume signal and reset the processes freeze timeout to the default value and reset the flag here maybe

If we do not reset the flag, I think the design here will not work for the next time the guest is suspended?

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #5727 +/- ## ======================================== - Coverage 45.5% 44.8% -0.7% ======================================== Files 13 16 +3 Lines 1624 2210 +586 ======================================== + Hits 739 992 +253 - Misses 885 1218 +333 ``` [see 8 files with indirect coverage changes](https://app.codecov.io/gh/xapi-project/xen-api/pull/5727/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xapi-project) | [Flag](https://app.codecov.io/gh/xapi-project/xen-api/pull/5727/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xapi-project) | Coverage Δ | | |---|---|---| | [python2.7](https://app.codecov.io/gh/xapi-project/xen-api/pull/5727/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xapi-project) | `45.5% <ø> (ø)` | | | [python3.11](https://app.codecov.io/gh/xapi-project/xen-api/pull/5727/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xapi-project) | `51.4% <ø> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xapi-project#carryforward-flags-in-the-pull-request-comment) to find out more.

LunfanZhang commented 3 months ago

the guest agent will receive a resume signal and reset the processes freeze timeout to the default value and reset the flag here maybe

If we do not reset the flag, I think the design here will not work for the next time the guest is suspended?

No, it will still work. whatever reset or not.

gangj commented 3 months ago

I think the change here does not only apply to linux guest?(The suspend is called for all guests?) Although only linux guest will handle the write on the flag. If yes, please update the commit message, thanks~

LunfanZhang commented 3 months ago

I think the change here does not only apply to linux guest?(The suspend is called for all guests?) Although only linux guest will handle the write on the flag. If yes, please update the commit message, thanks~

sure, I have updated the commit message.

psafont commented 3 months ago

The design is straightforward, the guest agent will start with watching this flag, Upon detecting a change triggered by a write operation here, guest agent will receive an event to indicate a suspending, then the guest agent will do something like extend processes freeze timeout, after the suspend is done, and the guest resume again, the guest agent will receive a resume signal and reset the processes freeze timeout to the default value and reset the flag here maybe( from perspective of guest, the flag's actual value is less important than receiving timely notifications to adjust the timeout, that`s why I said I am not sure if it is necessary above).

It's not straightforward nor obvious without the proper context, which is why I'm asking about it.

So the key is written by xapi to signal the guest to get read, the guest reads the value and enters a special state. Xapi at some point pauses the domain. How does xapi know when is the domain ready to be paused? In other words, how does the guest let xapi know that it has prepared for suspension? What happens if the guest fails to acknowledge the petition? Does xapi proceed after a timeout is reached? What a reasonable amount of time to react?

After xapi resumes the guest, it now writes to xenstore again (what is written exactly, the key is removed, only blanked out, or a different value?) Does xapi have to wait acknowledgement from the guest before another suspension can proceed?

This looks complicated, because it is, there are two endpoint communicating, and one of them might even be malicious. This needs proper thought and consideration, which I find lacking.

edwintorok commented 3 months ago

it`s good suggestion to report back the failure of suspend, but the challenge is that suspend will freeze all the user space process which also include the guest agent, so it is hard to detect the failure from perspective of guest agent as well. if the suspend is blocked we cannot ensure the guest agent is running

If the freeze fails then the kernel will unfreeze processes, and then the guest agent will run again. But what we need to detect in that case is whether we resumed because of a freeze failure, or we resume because Dom0 asked us to.

The difference could currently be detected by checking whether the domid has changed when the guest agent resumes. If the domid is the same then we know it hasn't been suspended. But in the future we might use deterministic domids (or even just call everything domid 1), so in that case this detection won't work anymore.

But we dont even need to report failures, we could just report on a specific xentore key whenever the guest agent has detected that it was suspended and resumed (I assume there'd be a way to detect this somehow with a Linux syscall or ioctl, or listening for some event). Then xenopsd could watch for that event while waiting for the VM to actually suspend. If it gets the 'guest agent resumed' event then it can cancel the suspend, knowing that the suspend failed inside the guest. No need to look at domid in this case.

The only question is then how to detect. freezing and unfreezing from the guest agent, and that probably needs checking the code that runs the freezer.

LunfanZhang commented 3 months ago

It's not straightforward nor obvious without the proper context, which is why I'm asking about it.

So the key is written by xapi to signal the guest to get read, the guest reads the value and enters a special state. Xapi at some point pauses the domain. How does xapi know when is the domain ready to be paused? In other words, how does the guest let xapi know that it has prepared for suspension? What happens if the guest fails to acknowledge the petition? Does xapi proceed after a timeout is reached? What a reasonable amount of time to react?

After xapi resumes the guest, it now writes to xenstore again (what is written exactly, the key is removed, only blanked out, or a different value?) Does xapi have to wait acknowledgement from the guest before another suspension can proceed?

This looks complicated, because it is, there are two endpoint communicating, and one of them might even be malicious. This needs proper thought and consideration, which I find lacking. I want to know what's the protocol, who reads and who writes, it's still not clear to me. It's important to understand we aren't the only ones creating clients here

Ok, let me clarify more: The key, written by xapi, signals the guest. Upon receiving this signal, the guest adjusts configurations, such as extending the freeze timeout. My lab tests indicate a consistent gap of over 10 seconds between updating the freeze timeout and the kernel initiating process freezing. Therefore, in most cases, xapi doesn't need to verify if the guest is prepared for suspension. in most of case, it already be set. Even in rare instances where configurations are applied post-suspension, the kernel default timeout is sufficient to continue the suspension process without interruption. While adding a verification step could ensure the configuration has been updated, it would also delay the suspension, and introduce complexity, it`s balance. I would like to keep simple.

after guest resume, the guest agent resets the kernel's freeze timeout. To simplify, resetting the xenstore key isn't necessary, though it's an option. Any update to the xenstore key, even with the same value, triggers a notification to the guest agent. The essential requirement for the guest agent is to adjust configurations at the optimal moment—before the kernel begins freezing processes— I do not want whole design to be complicated.

psafont commented 3 months ago

I do not want whole design to be complicated.

I understand this concern, but using xenstore means host and guest need to communicate, which needs a well-design protocol, which is complex. The simplest design is to not create a new protocol.

Therefore, in most cases, xapi doesn't need to verify if the guest is prepared for suspension

Caring about most cases instead of all is conducive to brittle design, I don't think it's good enough

LunfanZhang commented 3 months ago

Caring about most cases instead of all is conducive to brittle design, I don't think it's good enough

In the worst case, the kernel did not update the freeze timeout, the default timeout will still work, it is not brittle. there is no functional error here. and as above, suspect we really want a check if update take effective here, then too many scenarios should be under consideration, like if guest agent is alive or not? What a reasonable amount of time to react as you mentioned? if guest failed to update should retry or not, then the suspension will be delay more?
again, I think it`s balance and I prefer the sample one because the result of update the value failed is rare and acceptable.

liulinC commented 3 months ago

As described, the background of this PR is just give extra time before suspend guest, and restore the original value to reduce this impact.

@edwintorok Can you explain why Otherwise we need to wait for the very long xenopsd timeout, even though the suspend would've failed almost immediately. Is this some existing issue, rather than involved by this PR? Or some kind of new improvement?

Besides, my reading (and I may be wrong), For ACK shutdown/suspend/poweroff (https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L1194), XenOpsd write to xenstore https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L591,

and wait the key to be set to "", https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L652

For linux PV driver, it will ack this request before perform the action: https://github.com/torvalds/linux/blob/master/drivers/xen/manage.c#L259 In such case, whether the suspend failed or not, xenopsd got notified immediately.

What am I missing?

Edit: After more reading, Above is the process to request shutdown, then xenopsd wait for shutdown, currently only wait for shutdown event, https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2377

We probably improve this by check some xenstore key as well. Again, this is some kind of existing issue, NOT caused by this PR. But still good to improve.

liulinC commented 3 months ago

If the freeze fails then the kernel will unfreeze processes, and then the guest agent will run again. Yes, as long as the freeze failed, we will thaw all user space process and kernel threads, thus, guest agent got a change to run. However it does not know whether the suspend succeed or not, If linux guest has to report back the result, there may be options:

let PV drivers report the result, https://github.com/torvalds/linux/blob/master/drivers/xen/manage.c#L110 It defined multiple steps and knows exactly which step fails.
PV driver expose some interface to user space, like through sysfs, then guest agent can query the status of freeze.
If PV drivers freeze failed, it always print a log with level Error, so guest agent can parse the kernel log to get a hint.

@LunfanZhang It always good to draw a diagram for the protocol between guest and dom0.

minglumlu commented 3 months ago

a consistent gap of over 10 seconds between updating the freeze timeout and the kernel initiating process freezing

Why there is a 10-sec gap?

minglumlu commented 3 months ago

Is the process/kernel-task/... which is refusing freeze always a particular one? This case starts to fail recently? Can the Linux guest agent set the "/sys/power/pm_freeze_timeout" as a common/particular change on it own? This doesn't require the Toolstack change at all. It seems this change doesn't require to provide a variable to the timeout.

And there is a timer in xenopsd as well. If the timer in guest kernel is larger than the timer in xenopsd, the xenopsd will get timed out before the guest's response.

liulinC commented 3 months ago

pm_freeze_timeout

Only ask guest agent to update this configuration is an option at the very beginning.

I believe @LunfanZhang had some thinking and discussion with CoP. The major reason I believe is to set/restore the configuration as soon as possible, to reduce the impact to guest, as pm_freeze_timeout is a global configuration.

Guest agent just want to set this configure before suspend VM and restore the configure after suspend VM. The reason to introduce xapi change, is because guest agent does not know when xapi will suspend VM. So xenstore key is introduced here, to let xapi tell guest agent the control message.

I understand the concern from @psafont, xapi set the trigger key in xenstore and does not get any response from guest agent (whether guest agent has already set this configuration). Then we do not know whether xapi should wait guest agent to get ready.

Well, what guest agent does is only set the configuration, to extend the freeze timeout. That means even it is not set/ready, and xapi does not wait, then it has no bad effect. (just as it is now). If we take this as best-effort extend timeout, then it is fine.

@LunfanZhang There is another option, e.g PV driver update this configuration, then no guest agent/xapi change. (But his will requires sending linux kernel patch to upstream, review, and guest distros take it, very very long path)

lindig commented 3 months ago

Does this affect other implementations of a guest agent that would not pick up this protocol immediately?

xapi-project / xen-api

CP-49368 introduces a flag to xenstore before suspend Linux guest #5727

Codecov Report