zfsonlinux / pkg-zfs

Native ZFS packaging for Debian and Ubuntu
https://launchpad.net/~zfs-native/+archive/daily
308 stars 55 forks source link

boot from snapshot broken #201

Closed cfellinger closed 8 years ago

cfellinger commented 8 years ago

Hai, first thanks for all the work to get ZFS on linux, and in particular in Debian.

Unfortunately I can't get booting from snapshot working. After inspecting the code [from zfs-linux (0.6.5.6-3) jessie] I think I spotted some errors, but still no success.

in /usr/share/initramfs/scripts/zfs in line 281 ZFS_BOOTFS gets bluntly overwritten with the bootfs property from rpool. I thought the kernel command line was to overrule this property. So probably something like the following would do:

if  import_pool "${find_rootfs "${ZFS_RPOOL}"  &&  [ -z "${ZFS_BOOTFS}" ] 
then
    rfs="$(find_rootfs "${ZFS_RPOOL}")" &&  ZFS_BOOTFS="${rfs}" 
fi

And then in /etc/zfs/zfs-functions in line 1084/85 what seems like good code doesn't do what's expected. The list of snapshots sticks together, so a multiline value is used as snapshot name. Probably IFS got set somewhere. But here me knowledge of bash lacks.

As an aside, as coded now the selection of names matching snapname is a little brittle, it wil match if the name itself, and not merely the snapshot part, matches. More robust seems to restrict the match to purely the snapshot extension like:

for  x  in  $("{ZFS}"  list  -H  -oname  -tsnapshot  -r  "${rootfs} | \
    grep "@${snapname}$"
cfellinger commented 8 years ago

Partly success: the IFS problem can be mittigated with the following:

for  x  in  $(IFS= "{ZFS}"  list  -H  -oname  -tsnapshot  -r  "${rootfs} | \
    grep "@${snapname}$")

now snapshots will be properly cloned and boot succeeds. Eilas, booting from nested bootfs datasets still doesn't work.

FransUrbo commented 8 years ago

My time is quite limited, but I'll try to work on this this weekend. If anyone have a patch, I'll be a much happier camper :).

cfellinger commented 8 years ago

you want patch files? I've attached the two I need to get it working.

zfs-functions--patch.txt initramfs-tools_scripts_zfs--patch.txt

I'll be away most of saterday during the day, but otherwise I'll be available for testing

FransUrbo commented 8 years ago

You mentioned that it doesn't work with nested datasets.

cfellinger commented 8 years ago

yes, things like:

mounted datasets:
NAME                       CANMOUNT  MOUNTPOINT           MOUNTED
rpool/ROOT/system            noauto  /target                  yes
rpool/ROOT/system/var            on  /target/var              yes
rpool/ROOT/system/var/log        on  /target/var/log          yes

this will boot properly, but not from snapshot, most likely I'm doing something unsupported, the snapshots look like:

$ fs get mountpoint rpool -r
NAME                           PROPERTY    VALUE            SOURCE
rpool                          mountpoint  none             local
rpool/ROOT                     mountpoint  none             inherited from rpool
rpool/ROOT/system              mountpoint  /target          local
rpool/ROOT/system@ONE          mountpoint  -                -
rpool/ROOT/system@TWO          mountpoint  -                -
rpool/ROOT/system/var          mountpoint  /target/var      inherited from  rpool/ROOT/system
rpool/ROOT/system/var@ONE      mountpoint  -                -
rpool/ROOT/system/var@TWO      mountpoint  -                -
rpool/ROOT/system/var/log      mountpoint  /target/var/log  inherited from rpool/ROOT/system
rpool/ROOT/system/var/log@ONE  mountpoint  -                -
rpool/ROOT/system/var/log@TWO  mountpoint  -                -
rpool/SWAP                     mountpoint  -                -

thee boot stops with the following error message

 Command: mount_fs rpool/ROOT/system_ONE
 Message:
 Error: 1

 Error: Failed to mount root filesystem 'rpool/ROOT/system_ONE'.

 ..BusyBox v1.22...

and zfs list shows no clones, not even for the root. I wil try echo statements to see where it looses track

As an aside: booting from unknown snapshot or from a mere @ doesn't work either, I get an empty list to choose from though I've two snapshots

FransUrbo commented 8 years ago

Then that needs to be fixed to. Do you have a patch for that to?

cfellinger commented 8 years ago

ah that one (asking what snapshot to use) proved to be easy, once I looked at the proper place that is :) just add "-r" in ask_user_snap() to get them all

--- /etc/zfs/zfs-functions      2016-04-06 17:52:08.000000000 +0200
+++ /target/etc/zfs/zfs-functions       2016-04-16 01:53:46.309200222 +0200
@@ -1026,7 +1026,7 @@
            eval `echo SNAP_${i}="${snap}"`
            i="$((i + 1))"
        done <<EOT
-$("${ZFS}" list -H -oname -tsnapshot "${fs}")
+$("${ZFS}" list -H -oname -tsnapshot -r "${fs}")
 EOT

    echo -n "  Snap nr [0-$((i-1))]? " > /dev/stderr

for your convenience, added here as a patch file ask-snap-patch.txt

The problem with nested datasets keeps bugging me, I've tried a lot today, it seems to be related to IFS magic, so I'm at lost here :(

FransUrbo commented 8 years ago

I consider this fixed in Wheezy, v0.6.5.6-4-wheezy and Jessie, v0.6.5.6-4

cfellinger commented 8 years ago

Thanks for looking into it, but eilas it still doesn't work, not over here atleast. What does work is mounting the sub datasets. But the snapshot part is cut off no cloning preformed and the base version is mounted.

Unfortunately your patches were quite different from mine so I've to redo the testing

FransUrbo commented 8 years ago

Did you make sure to install the new version of /etc/zfs/zfs-functions?

cfellinger commented 8 years ago

yes, ofcourse, but you ommitted part of the proposed patches :) the one in /usr/share/initramfs-toos/scripts/zfs and so the ZFS_BOOTFS from the commandline gets bluntly overruled

--- /usr/share/initramfs-tools/scripts/zfs      2016-04-17 16:15:31.000000000 +0200
+++ /target/usr/share/initramfs-tools/scripts/zfs       2016-04-17 18:23:35.217403997 +0200
@@ -277,6 +277,7 @@
        if [ -n "${ZFS_RPOOL}" -a -z "${POOL_IMPORTED}" ]
        then
                import_pool "${ZFS_RPOOL}" && \
+                   [ -z "${ZFS_BOOTFS}" ] && \
                        rfs="$(find_rootfs "${ZFS_RPOOL}")" && \
                                ZFS_BOOTFS="${rfs}"

atleast now I get to boot from snapshot AND the sub datasets get properly cloned AND mounted

When running in zfs_debug mode and no quiet on the commandline, I get faiiled messages, like:

Begind: CMD: '' ... [...SPL..using hostid...]
done.
Begin: CMD: '' ... Failure: 2
Begin: CMD: '' ... Failure: 1
Begin: CMD: '' ... Failure: 1
Begin: CMD: '' ... Failure: 1
Begin: CMD: '' ... done.
Begin: CMD: '' ... done.
Begin: CMD: '' ... done.
DEBUG: imported pools
FransUrbo commented 8 years ago
            import_pool "${ZFS_RPOOL}" && \
  • [ -z "${ZFS_BOOTFS}" ] && \ rfs="$(find_rootfs "${ZFS_RPOOL}")" && \ ZFS_BOOTFS="${rfs}" Ok, I can see the point of doing that. However, I need to ponder the repercussions of this. atleast now I get to boot from snapshot AND the sub datasets get properly cloned AND mounted

So what's the problem? When running in zfs_debug mode and no quiet on the commandline, I get faiiled messages, like:

Begind: CMD: '' ... [...SPL..using hostid...] done. Begin: CMD: '' ... Failure: 2 Begin: CMD: '' ... Failure: 1 Begin: CMD: '' ... Failure: 1 Begin: CMD: '' ... Failure: 1 Begin: CMD: '' ... done. Begin: CMD: '' ... done. Begin: CMD: '' ... done. DEBUG: imported pools I'm going to need more information, the full output.

FransUrbo commented 8 years ago

Begin: CMD: '' ... Failure: 2 These lines also look suspicious. I can find NO reference to the > ' < chars anywhere in the code! Nor do I know where the 'Failure: [12]' come from.

From what you've describing, I can only conclude that you're using old code somewhere! You need to go through everything again and make absolutly sure that you're using my code! And ONLY that!

The files that should be used in this are:

* from the zfsutils package
   /etc/zfs/zfs-functions

* from the zfs-initramfs package
   /usr/share/initramfs-tools/conf-hooks.d/zfs
   /usr/share/initramfs-tools/hooks/zfs
   /usr/share/initramfs-tools/scripts/zfs
cfellinger commented 8 years ago

i'm quiet sure I'm using your things, but I've checked with md5sums

root@rescue ~ # apt-get download zfsutils
Get:1 http://archive.zfsonlinux.org/debian/ jessie/main zfsutils amd64 0.6.5.6-4 [385 kB]
Fetched 385 kB in 0s (529 kB/s)  
root@rescue ~ # apt-get download zfs-initramfs
Get:1 http://archive.zfsonlinux.org/debian/ jessie/main zfs-initramfs amd64 0.6.5.6-4 [23.1 kB]
Fetched 23.1 kB in 0s (68.2 kB/s)        

extracted the relevant files and run md5sums, and they're all equal

cfellinger commented 8 years ago

on the comment-line-overrule problem, it's a bit tricky for me to get full error report as i'm testing in a qemu-system session and i don't know how to cut and paste from there. I' could upload screen shots if that's okee

FransUrbo commented 8 years ago

extracted the relevant files and run md5sums, and they're all equal

You need to trace the build of a initrd. Because it sure looks like a file NOT part of the package (or an old version some how) is included instead!

Because I can't reproduce this and from what you describe of the output, something looks "fishy".

FransUrbo commented 8 years ago

You need to trace the build of a initrd.

.. and make sure that initrd is actually used!

cfellinger commented 8 years ago

Well I'm testing in a freshly dbootstrapped machine, so I don't see what can go wrong :( What do you want me to do, "trace the build of initrd" doesn't ring a bell, but if I change lines (add debug lines) to those files, I sure get the expected result when I start the machine. Maybe you overlooked the part where I state "running in zfs_debug mode and no quiet on the commandline", both are needed to get the above failure codes.

cfellinger commented 8 years ago

You wanted extra input on "ZFS_BOOTFS from the commandline gets bluntly overruled"? here is a screenshot: screenshot - 04172016 - 08 46 58 pm

FransUrbo commented 8 years ago

Now I see it, thanx:

--- /etc/zfs/zfs-functions~     2016-04-17 14:47:00.048097033 +0100
+++ /etc/zfs/zfs-functions      2016-04-17 20:35:05.638951706 +0100
@@ -162,7 +163,7 @@
        check_boolean "${quiet}" && \
                zfs_log_begin_msg "${MSG} "
        check_zfs_debug && \
-               zfs_log_begin_msg "CMD: '${cmd}'"
+               zfs_log_begin_msg "CMD: '${CMD}'"

        zfs_run_cmd "${CMD}"
FransUrbo commented 8 years ago

Btw, when booting with zfsdebug=1, then when the last question comes up (continue boot or spawn a shell) you can retrieve a full log from /var/log/zfs-boot.debug.

cfellinger commented 8 years ago

It occurred to me that those failure lines only show up the second time I boot from the same snapshot, so at destroy time, and low and behold, when I applied your patch (cmd => CMD) it tells so:

screenshot-d

FransUrbo commented 8 years ago

Yeah, I saw the destroy problem to. Fixed with

--- /etc/zfs/zfs-functions~     2016-04-17 20:35:05.638951706 +0100
+++ /etc/zfs/zfs-functions      2016-04-17 20:49:18.226373449 +0100
@@ -1084,9 +1084,10 @@
                # If the destination dataset for the clone
                # already exists, destroy it. Recursivly
                if [ $(get_fs_value "${rootfs}_${snapname}" type) ]; then
-                       filesystems="$("${ZFS}" list -oname -tfilesystem -H \
-                           -r -Sname "${ZFS_BOOTFS}")"
-                       for fs in ${filesystems}; do
+                       "${ZFS}" list -oname -tfilesystem -H -r -Sname \
+                               "${ZFS_BOOTFS}" | \
+                       while read fs
+                       do
                                destroy_fs "${fs}"
                        done
                fi

I'm building new packages as we speak, with both these fixes.

cfellinger commented 8 years ago

Ah, yet again an IFS problem. For some strange reason it gets set to a single space (why not " \t\n" ?) early on, and that gets to be the deafult setting which doesn't work out to well for such FOR-loops.

The fix for the comment-line-overrule problem is still postponed? It seems an essential fix to get booting from snapshot working, here at my site that is.

FransUrbo commented 8 years ago

No, that's coming in this version as well. It makes sense - "kernel command line options is always right".

cfellinger commented 8 years ago

great, thanks a lot. That leaves me with just some nitpicking.

As I stated in the original issue report the selection of snapshots to clone is a little brittle, it simply matches to much. Better would be to anchor the grep like:

grep "@${snapname}$"

otherwise both dataset@long-snap-name and dataset@snap will match when requesting @snap, or so it seems to me from looking at the code. If you want to I can do some testing after the latest and greatest package gets installed here

FransUrbo commented 8 years ago

Seems reasonable. Please do test. Create a new issue for that though, this is getting a little to long :)

cfellinger commented 8 years ago

I've run my test: boot from snapshot works as advertized, really awesome. for overlapping snapshot names problem I refer to issue #206