Take-away lessons of a volume plugin (PoC) for zfs

MatiasVara commented 2 years ago

Hello everyone,

in this issue, I would like to share my experience in writing a volume plugin for zfs filesystem for the smapiv3. In this implementation, the volume plugin for zfs represents Storage Repositories (SR) as pools and Volumes as zfs volumes. A zfs volume is a dataset that represents a block device and is created in the context of a filesystem. When creating a volume, it can be accessed as a raw block device, e.g., /dev/zvol/pepe. Zfs supports operations over volumes like snapshot(), clone() or promote(). This simplifies the driver, but, in some cases, the actual implementation becomes a bit tricky because the XAPI does some operations in a certain order that can't be done in a zfs filesystem in the same order. I though this PoC could be interesting to understand better why those tasks are tricky. This is a summary and there are still many things that I am not sure about. First of all, the xe sr-create command ends up invoking zpool create [name] to create a pool. The only required parameter is the block devices in which the zfs fs will be installed. The xe vdi-create ends up invoking zfs create -V [pool] [name] [size]. This creates a new volume in the pool. The volume can be accessed as a raw block device at /dev/zvol/[name]
In zfs, snapshots are taken by relying on the zfs snapshot [volume] command. When a new snapshot is created, the new volume.id is used as the name for the snapshot. Snapshots are read-only volumes. To access the snapshot, we have to either mount it or create a clone from the snapshot. Otherwise, the snapshot is not accessible. The current PoC always creates a clone when a snapshot is taken. The clone is named as the snapshot but it belongs to the pool where is created. You can see this in the following output. When the snapshot @2 is created the cloned volume 2 is created too.

$ zfs list
NAME       USED  AVAIL     REFER  MOUNTPOINT
hola      10,3G  8,58G     48,5K  /hola
hola/1    10,3G  18,9G       12K  -
hola/1@2     0B      -       12K  -
hola/2       0B      -       12K  -

This cloned volume is important when issuing xe vm-copy, i.e., create-vm-from-snapshot, since the command requires accessing the volume to copy the content for the new VM's VDI. This is not possible if the volume is a snapshot. Another example is the command xe snapshot-revert. This command reverts the state of a VM from a snapshot. The first step tries to destroy the current VDI. However, this is not possible since the current VDI has children, e.g., the snapshot. The correct way to do it is to directly clone from the snapshot, promote the new volume and finally destroy the main VDI. The current PoC only accepts snapshot for the clone method and it worked around to destroy the parent VDI just after the new volume is promoted. The current implementation of xe snapshot-revert is as follows:

Destroy parent vdi but fail to remove it from db and zfs
Clone from snapshot
- A volume is created from the snapshot
- The volume is promoted
- The parent volume is destroyed
- The children of the main volume are promoted to the clone.

Note that this works only if reverting is from the latest snapshot. Otherwise, the main volume can't be destroyed because there are still newer snapshots that can't be promoted to the new clone. The current implementation of volume destroy relies on the zfs destroy command. The method checks if the volume is a snapshot or a volume. If it is a snapshot, the method first builds the correct path to the snapshot, and then, destroys it. The method also checks if there is a clone with the same name and destroys it. This aims at removing the clone that the snapshot command creates. Note that when trying to destroy a volume with children, the zfs destroy fails but the vdi-destroy success. This is an overall summary of the current PoC. I may be missing some chunks. I may release a design document soon that explains all the implementation details and decisions.

psafont commented 2 years ago

The code here was merged into xapi-project:xen-api We can set up a session this week, probably with @edwintorok to speak about smapi v3 and future devekopment

MatiasVara commented 2 years ago

I set up a design session already if you would like to talk about this topic.

xapi-project / xapi-storage

Take-away lessons of a volume plugin (PoC) for zfs #114