Raise default values for LIVE_LEAF_COALESCE_TIMEOUT and ..._MAX_SIZE?

stormi commented 4 years ago

Feedback in #298 showed that in some cases raising the default values for LIVE_LEAF_COALESCE_TIMEOUT and LIVE_LEAF_COALESCE_MAX_SIZE solved coalesce issues.

We may want to consider those changes, if we can evaluate the global impact of changing those values. I don't want to change them blindly.

Or maybe the new coalesce algorithm from https://github.com/xapi-project/sm/commit/b43525cb49047f712d221ea289623b942802627a makes it unnecessary?

Wescoeur commented 4 years ago

Well, it seems not a good idea to increase these variables like here: https://github.com/xcp-ng/xcp/issues/298#issuecomment-557805054

As said in this Citrix article concerning the default values: https://support.citrix.com/article/CTX201296

In most cases, the active VM’s snapshot coalesce failing rather than the offline coalesce which normally succeeds. When the VM is active and needs to be coalesced during that time, we need to hold the current writes to the VM. This is handled by Single-Snapshotting the VM.

The Single-Snapshotting will continue until the delta between the leaf and parent VHD is less than 20MB. If delta is more than 20M, it does this by single-snapshotting and deleting the snapshot. When the loop exists, a leaf coalesce (reducing to a single disk) is performed in any case. The factors that create the loop are, when the VM is I/O intensive and when the time frequency increases the default threshold of pausing tapdisk process. The default is 10 seconds.

I don't think it's a good thing to pause tapdisk and to copy 1GB of delta globally. What's the cost? Is it nice to freeze the VM during many seconds?

But we can try to increase the default values for specific devices like SSD in a future commit. Raise the default values is perfect for specific use cases, maybe we can try to add many options to configure specifically a storage.

danieldemoraisgurgel commented 4 years ago

So far, the change has not brought problems for me. And my environment I consider to be relatively large, with 350 VMs and 82 TB (hosting services, database, applications, etc...).

I remember before the patch, there was a load limiter, so that the coalesce process wouldn't slow down the system... and that in the fix, this was removed/changed.

The new patch seems to be welcome, however, Citrix seems to walk to a closed world the VDI solution, where use is specific... we would have to test the patch and know if it will actually work in mixed environments and with multiple loads.

stormi commented 4 years ago

In your tests, were both changes to LIVE_LEAF_COALESCE_MAX_SIZE and LIVE_LEAF_COALESCE_TIMEOUT necessary? The change to max size seems huge to me and maybe a smaller value for timeout would still have worked?

danieldemoraisgurgel commented 4 years ago

There were several doings until the coalesce:

After Citrix "broke" coalesce 7.1 CU2 and we made a forced migration to CH8 and attempted to fix it in this release. Before that, we had already changed the values and yet the coalesce did not work.
We upgraded to CH8 and the coalesce did not work completely. We changed the values and the coalesce still didn't work.
We migrated from CH8 to XCP8. In the default installation, it had the same behavior as step 2.
After the patch provided by XCP, the problems have considerably decreased. 90% of the disks managed to coalesce, and problems with zombie processes in LVMoISCSI didn't occur either.
After changing the parameters in XCP, 100% of my disks had coalesce online - no need to shut down or pause the VM.

In our environment we have 1GB up to 1TB disks. With multiple Dell Equallogic 6200, 6500, Compellent SC5020, SC5020F, Lenovo SC22xx, IBM V7000, etc. stores.

As I reported earlier on other links, I don't know how much changing these values can influence our environment, but since then we have had no problems and everything goes hand in hand - in fact, we have taken this change even as a step in new installations.

Because our SANs operate at different speeds, I think the new patch can handle this more intelligently, however, the question is whether it will be too sensitive or adopt an aggressive policy to consolidate. As I see it, the health of storage is not having discs to perform coalesce, despite being a supported behavior.

stormi commented 4 years ago

Patch for new logic available in the updates for XCP-ng 8.1 beta: https://xcp-ng.org/forum/post/22794

Feedback highly welcome!

stormi commented 4 years ago

XCP-ng 8.1 was released with the new coalesce logic. I'm letting this bug report open in case tweaking the default values remains necessary in some situations.

xcp-ng / xcp

Raise default values for LIVE_LEAF_COALESCE_TIMEOUT and ..._MAX_SIZE? #323