LVM and performance (locks)

Wescoeur commented 3 years ago

Hi!

I have several questions about this lock: https://github.com/xapi-project/sm/blob/master/drivers/lvutil.py#L183-L186

Why always use the same path (/var/lock/sm/.nil/lvm) for all volumes and LVM SRs? For example, is it possible to use a lock path like /var/lock/sm/lvm/e9722947-a01a-8417-1edf-e015693bb7c9/cd9d03a6-ac19-42cf-9e8b-8625c0fa029b' instead? When a command like this is executed:

['/sbin/lvchange', '-ay', '/dev/VG_XenStorage-e9722947-a01a-8417-1edf-e015693bb7c9/VHD-cd9d03a6-ac19-42cf-9e8b-8625c0fa029b']

I'm not sure to understand the goal of this lock, there is already a lock in the smapi to protect the SR and specific locks regarding the VDIs, isn'it?

I ask these questions because in some cases, for example when several hundred snapshot commands are executed the performances quickly become disastrous. Looking at the execution of a single snapshot, it can take 3m30 including 1m50 lost because of this lock (because many parallel commands are executed). Would it be complicated and feasible to have a more specific lock for each VDI/SR?

Thank you!

MarkSymsCtx commented 3 years ago

Because the lock is protecting the LVM system metadata and that's global. We have had numerous bugs as a result of LVM operation on one LV corrupting the metadata for an entirely unrelated volume group.

Wescoeur commented 3 years ago

Thx, but to be sure, is it a pure LVM problem not directly related to the smapi?

MarkSymsCtx commented 3 years ago

Yes it is a pure LVM2 issue (although in large part due to the way it is used in the hypervisor). The upstream LVM system expect you to use lvmetad which the hypervisor can't do due to running LVM on multiple hosts without clustered LVM (clvm). This means the SM code runs at quite a divergence of expectations with respect to the upstream. Ultimately the answer is to deprecate the LVMo{ISCSI|HBA} SR types but that will take a long time. Migrating to use clustered LVM (which is what RedHat would advise doing) is not possible without incurring a complete Pool shutdown and so is not politically feasible.

Wescoeur commented 3 years ago

Okay, it makes sense! Thank you. Effectively, in this case I guess a good way to use the driver is to avoid running too many requests in parallel or operate on another clustered driver like you said.

Wescoeur commented 3 years ago

@MarkSymsCtx Just another question, is it possible to always keep LVM volumes activated? To avoid many commands like lvchange -ay. What is the purpose of disabling them when they are not in use?

MarkSymsCtx commented 3 years ago

Because they can only be active on one host at a time to prevent data corruption.

Wescoeur commented 3 years ago

I'm not sure to understand, the lock /var/lock/sm/.nil/lvm is not enough to prevent corruption?

MarkSymsCtx commented 3 years ago

Two distinctly different things.

/var/lock/sm/.nil/lvm protects the host local LVM metadata from corruption by concurrent LVM control operations on the same host. Deactivating the LVs when not in use on the specific host (and only ever having the LV active on one or zero hosts) prevents corruption to the LV contents when a different host is writing into the LV.

Wescoeur commented 3 years ago

Well when a different host writes into the LV the smapi locks are not sufficient against corruption? I don't see exactly the problem, could you be more precise please?

To be more informative, we have a pool of 3 hosts with ~1800 VDIs using the SR driver LVMoiSCSI (~22TiB SSD), when many snapshots are asked in parallel, a vdi_snapshot command takes 3min30 with a chain of +5VHDs, this duration is equal to ~3s when only one VDI is present in the SR. So it's a big difference. Like I said previously, at least 1m50 of the time is lost because of this lock. Is it a surprise for you in this context? We would like to possibly improve the LVM driver to handle this kind of case.

MarkSymsCtx commented 3 years ago

Well when a different host writes into the LV the smapi locks are not sufficient against corruption? I don't see exactly the problem, could you be more precise please?

Correct which is why locks are not used to protect against this. To protect against this the SM control plane ensures that LVs are only ever active on one host at a time.

We already wrap the snapshot operation in a higher level LockContext - https://github.com/xapi-project/sm/blob/master/drivers/LVHDSR.py#L1775. After much experimentation this was about as optimal as we could get it without introducing problems with either LV or metadata corruption.

olivierlambert commented 3 years ago

So we did a test with the modification, and it was only marginally better (1 minute and 32 secs without the modification, 1 minute and 29 seconds with).

I wonder @MarkSymsCtx is that an "possible"/expected time to make a single snapshot? (even if there is 3x SRs connected to this hosts with 1800+ VDIs on LVMoiSCSI)?

MarkSymsCtx commented 3 years ago

expected? No. Possible, yes. the issue is not in the number of SR or the number of VDIs but the number of links in the snapshot chain as for a snapshot to be taken of running VM the following must happen.

VM datapath must first be paused on the running host (one host to host call)
LVs must be deactivated on the running host as they cannot safely be active on more than one host (one host to host call and also requiring acquisition of the host local LVM metadata lock)
All LVs in the snapshot chain need to be activated on the pool's primary server, this requires the LVM metadata lock on the primary server
Snapshot is taken.
LVs are deactivated on the primary server
LVs are activated on the host running the VM (one host to host call and requiring the local host metadata lock on the target host)
VM datapath is unpaused (one host to host call)

Add into this that the LVM userspace tools (lvchange, lvcreate, etc.) are themselves very slow and the time for a snapshot become significant.

All of this is technical debt in the design and implementation of the LVMo{ISCSI|HBA} SRs dating back over a decade and now so tightly engrained that it is practically impossible to address especially with the limited investment opportunities available to do so.

In an ideal world with no resourcing restrictions a new SR based on SMAPIv3 would replace the LVM based SRs (including with, at minimum, inbound storage migration) and remove the layers of technical debt, layering violations and architectural dead-ends but it is unlikely to happen.

olivierlambert commented 3 years ago

This is clearly something we are interested to do (at least to invest resources into it). That would be great to have a discussion at some point on the big "steps" so we can be more productive than going blind without your experience on previous drawbacks of SMAPIv1 (and taking inspiration from what's done elsewhere).

I completely agree with the minimal requirements before getting there:

inbound storage migration (so it's easy to migrate from v1 to v3, without service disruption)
getting an idea on how to achieve an iSCSI shared block solution
Xen Storage Motion between different SMAPIv3 SRs (ideally both ways with SMAPIv1 but less a prio)
check different cases so we can also adapt XO for backup (likely NBD based)

There's some design session in the next Xen Summit where we could discuss that, but we can also plan a session outside it before. Would you be interested?

xapi-project / sm

LVM and performance (locks) #550