Closed Wescoeur closed 2 months ago
Because the lock is protecting the LVM system metadata and that's global. We have had numerous bugs as a result of LVM operation on one LV corrupting the metadata for an entirely unrelated volume group.
Thx, but to be sure, is it a pure LVM problem not directly related to the smapi?
Yes it is a pure LVM2 issue (although in large part due to the way it is used in the hypervisor). The upstream LVM system expect you to use lvmetad
which the hypervisor can't do due to running LVM on multiple hosts without clustered LVM (clvm). This means the SM code runs at quite a divergence of expectations with respect to the upstream. Ultimately the answer is to deprecate the LVMo{ISCSI|HBA} SR types but that will take a long time. Migrating to use clustered LVM (which is what RedHat would advise doing) is not possible without incurring a complete Pool shutdown and so is not politically feasible.
Okay, it makes sense! Thank you. Effectively, in this case I guess a good way to use the driver is to avoid running too many requests in parallel or operate on another clustered driver like you said.
@MarkSymsCtx Just another question, is it possible to always keep LVM volumes activated? To avoid many commands like lvchange -ay
. What is the purpose of disabling them when they are not in use?
Because they can only be active on one host at a time to prevent data corruption.
I'm not sure to understand, the lock /var/lock/sm/.nil/lvm
is not enough to prevent corruption?
Two distinctly different things.
/var/lock/sm/.nil/lvm protects the host local LVM metadata from corruption by concurrent LVM control operations on the same host. Deactivating the LVs when not in use on the specific host (and only ever having the LV active on one or zero hosts) prevents corruption to the LV contents when a different host is writing into the LV.
Well when a different host writes into the LV the smapi locks are not sufficient against corruption? I don't see exactly the problem, could you be more precise please?
To be more informative, we have a pool of 3 hosts with ~1800 VDIs using the SR driver LVMoiSCSI (~22TiB SSD), when many snapshots are asked in parallel, a vdi_snapshot command takes 3min30 with a chain of +5VHDs, this duration is equal to ~3s when only one VDI is present in the SR. So it's a big difference. Like I said previously, at least 1m50 of the time is lost because of this lock. Is it a surprise for you in this context? We would like to possibly improve the LVM driver to handle this kind of case.
Well when a different host writes into the LV the smapi locks are not sufficient against corruption? I don't see exactly the problem, could you be more precise please?
Correct which is why locks are not used to protect against this. To protect against this the SM control plane ensures that LVs are only ever active on one host at a time.
We already wrap the snapshot operation in a higher level LockContext - https://github.com/xapi-project/sm/blob/master/drivers/LVHDSR.py#L1775. After much experimentation this was about as optimal as we could get it without introducing problems with either LV or metadata corruption.
So we did a test with the modification, and it was only marginally better (1 minute and 32 secs without the modification, 1 minute and 29 seconds with).
I wonder @MarkSymsCtx is that an "possible"/expected time to make a single snapshot? (even if there is 3x SRs connected to this hosts with 1800+ VDIs on LVMoiSCSI)?
expected? No. Possible, yes. the issue is not in the number of SR or the number of VDIs but the number of links in the snapshot chain as for a snapshot to be taken of running VM the following must happen.
Add into this that the LVM userspace tools (lvchange, lvcreate, etc.) are themselves very slow and the time for a snapshot become significant.
All of this is technical debt in the design and implementation of the LVMo{ISCSI|HBA} SRs dating back over a decade and now so tightly engrained that it is practically impossible to address especially with the limited investment opportunities available to do so.
In an ideal world with no resourcing restrictions a new SR based on SMAPIv3 would replace the LVM based SRs (including with, at minimum, inbound storage migration) and remove the layers of technical debt, layering violations and architectural dead-ends but it is unlikely to happen.
This is clearly something we are interested to do (at least to invest resources into it). That would be great to have a discussion at some point on the big "steps" so we can be more productive than going blind without your experience on previous drawbacks of SMAPIv1 (and taking inspiration from what's done elsewhere).
I completely agree with the minimal requirements before getting there:
There's some design session in the next Xen Summit where we could discuss that, but we can also plan a session outside it before. Would you be interested?
Hi!
I have several questions about this lock: https://github.com/xapi-project/sm/blob/master/drivers/lvutil.py#L183-L186
/var/lock/sm/.nil/lvm
) for all volumes and LVM SRs? For example, is it possible to use a lock path like/var/lock/sm/lvm/e9722947-a01a-8417-1edf-e015693bb7c9/cd9d03a6-ac19-42cf-9e8b-8625c0fa029b'
instead? When a command like this is executed:I ask these questions because in some cases, for example when several hundred snapshot commands are executed the performances quickly become disastrous. Looking at the execution of a single snapshot, it can take 3m30 including 1m50 lost because of this lock (because many parallel commands are executed). Would it be complicated and feasible to have a more specific lock for each VDI/SR?
Thank you!