xapi-project / blktap

blktap, vhd stuff
Other
43 stars 81 forks source link

Team work on getting blktap in recent kernel #325

Open olivierlambert opened 4 years ago

olivierlambert commented 4 years ago

Hi there!

We started to think about various changes on blktap, especially to achieve these goals:

  1. IO_uring support for faster IOPS (we already have a PoC)
  2. Faster integration/merge in recent Linux kernels
  3. Ideally, having it upstream Linux if possible at some point

Obviously, doing this together and going into the right direction after some architecture decision might be faster for everyone, deliver better perfs for both Citrix Hypervisor and XCP-ng. What do you think? Where you'd like to start?

edwintorok commented 4 years ago

For 2/3: I think the goal is to use NBD instead of the block device, so we do not require a kernel patch anymore. io_uring would be interesting, but it is not available in the 4.19 kernel that we use.

olivierlambert commented 4 years ago

I'm not thinking about 4.19 but more recent kernel (cf the title of the issue). Something "future proof". So far, it's really hard to merge current blktap code in 5.x kernels.

Regarding 2 and 3, if you use NBD, what about VHD format? Is it mutually exclusive with it? So in short, you'll use NBD directly between the guest and dom0?

MarkSymsCtx commented 4 years ago

No, we'll use NBD from qemu to tapdisk without going via /dev/tdX (symlinked to /dev/sm/backend/XX) which will be deprecated and removed as the patch in the kernel to provide those devices cannot be forward ported to any 5.x or later kernel. Tapdisk will still write the data to VHD files/volumes.

MarkSymsCtx commented 4 years ago

We know that the current blktap patch will not be accepted into the upstream kernel, it's been asked previously, so the correct way forward is to not use it at all.

BenSimsCitrix commented 4 years ago

The nbd refit is almost complete with only a few bugs to fix in the toolstack.

Regarding the use of io_uring, If was hoping to extend the work done in https://github.com/xapi-project/blktap/pull/319/commits/ba79b736c3f6b2073fe94f50b07aca70249fa010. But if you have a similar plugin architecture that's fine.

We need some thought about howto fully systemd tapdisk.

olivierlambert commented 4 years ago

We'll be happy to help! Do you have any preliminary perfs results from switching from blktap to NBD? (if I'm correct with the way to make a recap of what you say here).

We can also help on the testing side of things, as long we know what branches to build together to generate RPM packages and start to do the bench work (we have some automated scripts to do nice fio graphs).

About your systemd questions, feel free to share your questions/blockers, more brains could mean faster thinking sometimes!

Wescoeur commented 3 years ago

Hi everyone, I would like to have more info concerning your message @MarkSymsCtx:

No, we'll use NBD from qemu to tapdisk without going via /dev/tdX (symlinked to /dev/sm/backend/XX) which will be deprecated and removed as the patch in the kernel to provide those devices cannot be forward ported to any 5.x or later kernel. Tapdisk will still write the data to VHD files/volumes.

I'm not sure to understand. Is this diagram up to date? https://wiki.xenproject.org/images/0/06/Blktap%24blktap_diagram_differentSymbols.png

What's the usage of qemu here? If I understand correctly the goal is to use a qemu instance in the host user space to write/read from the VM disk using blkback instead of the blktap patch or another driver? The data is then forwarded to tapdisk using NBD, isn'it? So to use a VM disk, we must have a qemu process + tapdisk to just read/write using NBD. Why not use qemu directly without tapdisk, like qemu-dp in SMAPIv3 but with VHD support?

Thank you in advance for your clarification!

MarkSymsCtx commented 3 years ago

HVM guests have a qemu process which handles the "emulated" devices required at boot. This is the primary thing requiring the /dev/blktap/blktapX devices. Once the PV drivers have loaded in guest all disk IO is performed via the PV ring direct to the tapdisk user space process. The diagram above is incorrect and shows blkfrnt communicating with in-kernel blktap. This doesn't happen, and hasn't been the case since blktap3. As the qemu "device model" process is also capable of taking its boot disk parameter as an NBD url (as is used in the case of the Citrix GFS2 SR as it doesn't provide a Dom0 block device interface) and the userspace tapdisk process also has an NBD server in it we can change the consumers of the block device to use NBD and thus no longer require the unsupportable in-kernel device, at which point the kernel patch can be dropped entirely. This will need to happen before moving to a 5.x kernel.

Wescoeur commented 3 years ago

Oh! I see, it's more clear now! I had trouble understanding the architecture, so you were talking about qemu-dm! :slightly_smiling_face: So NBD or the blktap patch is only used for HVM or PVM during the boot phase. Thank you so much!

MarkSymsCtx commented 3 years ago

Boot or before PV drivers are installed in the case of Window, yes.

jandryuk commented 3 years ago

I ported blktap2 to 5.4: https://github.com/OpenXT/xenclient-oe/blob/master/recipes-kernel/linux/5.4/patches/blktap2.patch

OpenXT uses tap-ctl to extract kernels and hash disk images during pre-boot.

Having said that, I wish for it to go away. Will tapdisk work today without the blktap2 module?

olivierlambert commented 3 years ago

@jandryuk we did it too IIRC (@Wescoeur did). But yes, we want to move forward too and make everything work without it. Maybe we should discuss that more in depth somewhere?

jandryuk commented 3 years ago

@olivierlambert do you have a pointer to your blktap2 patch? I had to fix some bugs, so I'm interested to compare.

OpenXT is tightly bound to tap-ctl and blktap2, so it's not something I am looking to tackle any time soon, sorry.

MarkSymsCtx commented 3 years ago

It won't currently work without as all the tap-ctl commands require the device minor id as a parameter in order to identify which disk the command refers to as tapdisk is able to handle multiple VBDs in a single tapdisk process (not that it gets used this way, certainly not in Citrix Hypervisor and I don't think in either xcp-ng or OpenXT either). This will require some careful thought, design and architectural oversight to address likely as a several phase operation, i.e. add a new parameter that can optionally (and preferentially) be used in place of the device minor id, update the control plane code and xenopsd to use this identifier and track down any places that fail or continue to use the "wrong" thing. Then finally remove the old parameter and the kernel patch.

olivierlambert commented 3 years ago

@Wescoeur do you remember where we stored your draft work on getting recent kernel working?

@MarkSymsCtx the best course will be obviously (for XCP-ng and likely Citrix Hypervisor) to get rid entirely on this legacy stuff. As we said earlier, we'll be happy to spend some time and efforts on this :+1:

MarkSymsCtx commented 3 years ago

A large chunk of it is sort of done, basically replacing use of /dev/tdX in Dom0 with NBD to the NBD socket provided by the server in Tapdisk and this completes end to end testing. But the next bits are to update what goes into XenStore as backend configuration keys, hint physical-device is no longer helpful, and make sure that tapback does the right thing with that. Then the designing of the updated commandlines needs to happen to allow the control plane to interact with tap-ctl without referring to the identity of the block device that we want to remove. (Oh, and all of this without breaking Storage XenMotion of course, it being the only current usage of the NBD server).

Wescoeur commented 3 years ago

@olivierlambert @jandryuk We have it in an internal gitlab. It's a quick and dirty poc but I can create a public repo.

@MarkSymsCtx Thank you for the details! :+1:

Wescoeur commented 3 years ago

Hi @MarkSymsCtx,

We are very interested in speeding up the removal of this patch and contributing on this repo. Could we have information about which projects use the "physical-device" property? I know xenopsd writes this value, tapdisk reads it, but are there other places where it’s used?

If I don't write the "physical-device" property I can start a VM using a nbd+unix URI with the socket path (I already patched my local smapi to do that), but I don’t understand where the change between emulation and PV driver occurs. What piece of code is in charge of this please? Is there a piece of code to modify in tapdisk or elsewhere to finalize this switch? Thank you!

Wescoeur commented 3 years ago

@MarkSymsCtx Do you have more info please? :slightly_smiling_face:

MarkSymsCtx commented 3 years ago

Not at this time, no I don't