System backup fails with error E_COPY_DATA

tbertels commented 2 years ago

The btrfs-auto-snapshot.sh script (https://github.com/xbianonpi/xbian-package-config-shell/blob/master/content/usr/local/sbin/xbian-storager) stops with error E_COPY_DATA

Version: 5.10.82+ #1 SMP PREEMPT Wed Dec 1 19:48:08 CET 2021 armv7l XBian 20210911-0 - Bleeding Edge

This seems to be a known problem: http://forum.xbian.org/thread-4106-post-36879.html

Home backup works fine on the same NFS network share.

Log excerpt:

Create a readonly snapshot of '/run/xbian-storage/root/@' in '/run/xbian-storage/root/@.ro'
Cloning root/@, 241130903 of 1148242399, 21% done
Cloning root/@, 562638775 of 1148242399, 49% done
Cloning root/@, 918593919 of 1148242399, 80% done
Cloning root/@, 1136759975 of 1148242399, 99% done
Cloning subvolume root snapshot @ failed
Operation failed with error E_COPY_DATA!

I can see the backup file being created on the network share before the error.

Would any other log help?

tbertels commented 2 years ago

Would using the much simpler script at https://github.com/hunleyd/btrfs-auto-snapshot as a base make it any easier to debug?

mkreisl commented 2 years ago

I vaguely remember the problem, but it was more with backuphome at the time. That was about two and a half years ago, I still have a test procedure for it. The error comes from the fact that in very rare cases the procedure that is responsible for the progress display returns an error (why this is so I can unfortunately not say). in the backuphome function this was corrected, but unfortunately this seems to be forgotten in the xbiancopy function. With me this has never occurred all the years, neither with backuphome nor with xbiancopy.

Anyway you can try to bring the following change in xbian-storager and test it again

diff --git a/content/usr/local/sbin/xbian-storager b/content/usr/local/sbin/xbian-storager
index beda774..5eedb10 100755
--- a/content/usr/local/sbin/xbian-storager
+++ b/content/usr/local/sbin/xbian-storager
@@ -1403,7 +1403,7 @@ EOF
                                                fi
                                        }
                                        ar=0
-                                       ( copy_volume $opt_cpmode ) 2>&1 | \
+                                       ( copy_volume $opt_cpmode ) 2>&1 | { \
                                                while read a; do
                                                        if [ -n "${a##*[!0-9]*}" ]; then
                                                                (( $a < $ar )) && a=100
@@ -1414,7 +1414,8 @@ EOF
                                                                fi
                                                                ar=$a
                                                        fi
-                                               done
+                                               done || :
+                                       }
                                        [ "$?" -eq 0 ] || { [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed"; exit $E_COPY_DATA; }
                                        if [ "$opt_fstype" = btrfs ]; then
                                                if [ "$opt_cpmode" = tar ]; then

If this works for you without errors, i will apply the patch

tbertels commented 2 years ago

No luck with that change. :( What should I remove to try without that progress display procedure? Or is that change supposed to do that? By the way, the error happens when starting the system backup manually through the GUI. I've added a daily automatic system backup to see if that works.

mkreisl commented 2 years ago

No luck with that change. :(

Strange

What should I remove to try without that progress display procedure? Or is that change supposed to do that?

Something like this:

                                   ( copy_volume $opt_cpmode ) >/dev/null 2>&1
                                   [ "$?" -eq 0 ] || { [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed"; exit $E_COPY_DATA; }

By the way, the error happens when starting the system backup manually through the GUI. I've added a daily automatic system backup to see if that works.

This should make no difference, as it runs in the background in both cases

tbertels commented 2 years ago

So that confirms it: the progress display procedure isn't the cause. Without that change, no progress display is shown but still the same error:

Done processing subvolume modules
Create subvolume '/run/root-dest/root'
Create a readonly snapshot of '/run/xbian-storage/root/@' in '/run/xbian-storage/root/@.ro'
Cloning subvolume root snapshot @ failed
Operation failed with error E_COPY_DATA!

I tried a sudo btrfs scrub start / just in case, but it reported no error.

So I guess the error comes from the function copy_volume. Is there a command that I can try to check that the btrfs send | btrfs receive works as intended (and maybe get a more detailed error)?

mkreisl commented 2 years ago

Here is my next patch:

diff --git a/content/usr/local/sbin/xbian-storager b/content/usr/local/sbin/xbian-storager
index beda774..d0c9b66 100755
--- a/content/usr/local/sbin/xbian-storager
+++ b/content/usr/local/sbin/xbian-storager
@@ -1396,11 +1396,12 @@ EOF
                                                                to='--warning=none --acls --xattrs'
                                                                nice -n 10 tar c $to -C ${vs}/$s.ro . | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 tar x $to -C "${vd}/$s"
                                                        else
-                                                               nice -n 10 btrfs send "${vs}/$s.ro" | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 btrfs receive "${vd}"
+                                                               nice -n 10 btrfs send 2>/tmp/btrfs.send.log "${vs}/$s.ro" | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 btrfs receive 2>/tmp/btrfs.receive.log "${vd}"
                                                        fi
                                                else
                                                        nice -n 10 zfs send --props $v@ro | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 zfs receive -u "$opt_clonename/${v#*/}"
                                                fi
+                                               echo "ps=\"${PIPESTATUS[*]}\"" >/run/ps.xbiancopy
                                        }
                                        ar=0
                                        ( copy_volume $opt_cpmode ) 2>&1 | \
@@ -1415,7 +1416,14 @@ EOF
                                                                ar=$a
                                                        fi
                                                done
-                                       [ "$?" -eq 0 ] || { [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed"; exit $E_COPY_DATA; }
+                                       . /run/ps.xbiancopy && rm -f /run/ps.xbiancopy
+                                       for RC in $ps; do
+                                               if [ "$RC" -ne 0 ]; then
+                                                       print_log error "error(s) happened while copying data ($ps)"
+                                                       break
+                                               fi
+                                       done
+                                       [ "$RC" -eq 0 ] || { [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed (ps=$ps)"; exit $E_COPY_DATA; }
                                        if [ "$opt_fstype" = btrfs ]; then
                                                if [ "$opt_cpmode" = tar ]; then
                                                        eval $opt_remotecmd nice -n 10 btrfs fi sync "${vd}/$s" && sync && btrfs sub delete "${vs}/$s.ro" || exit $E_SUB_DELETE

Reverted the previous patch. Now, in case of error the pipe return values are in the error message (ps=X X X)

And, for debugging btrfs send and btrfs receive prints output into file /tmp/btrfs.send.log resp /tmp/btrfs.receive.log

But I can't say if there are reasonable error messages in case of an error. I never had any errors there before

tbertels commented 2 years ago

Nice! I think the problem seems obvious now:

xbiancopy.log:

Create subvolume '/run/root-dest/root'
Create a readonly snapshot of '/run/xbian-storage/root/@' in '/run/xbian-storage/root/@.ro'
Cloning root/@, 241138429 of 1148278236, 21% done
Cloning root/@, 528207988 of 1148278236, 46% done
Cloning root/@, 895657024 of 1148278236, 78% done
Cloning root/@, 1136795453 of 1148278236, 99% done
Error: error(s) happened while copying data (141 0 1)
Cloning subvolume root snapshot @ failed (ps=141 0 1)
Operation failed with error E_COPY_DATA!

btrfs.send.log:

At subvol /run/xbian-storage/root/@.ro

btrfs.receive.log:

At subvol @.ro
ERROR: writing to var/cache/apt/srcpkgcache.bin failed: No space left on device

Here's the result of df -h:

Filesystem      Size  Used Avail Use% Mounted on
udev            1.6G  4.0K  1.6G   1% /dev
tmpfs           360M  1.6M  358M   1% /run
/dev/mmcblk0p2   30G  2.3G   27G   8% /
/dev/mmcblk0p2   30G  2.3G   27G   8% /lib/modules
/dev/mmcblk0p2   30G  2.3G   27G   8% /home
/dev/mmcblk0p2   30G  2.3G   27G   8% /xbmc-backup
/dev/mmcblk0p1  100M   46M   55M  46% /boot
none            4.0K     0  4.0K   0% /sys/fs/cgroup

mkreisl commented 2 years ago

That is not yet obvious to me Either the image file is created too small or no space available on the NFS share

How big is the image file? Should have at least 2.3GB (minus used space by snapshots)

tbertels commented 2 years ago

Here's what df- h returns just before the error:

Filesystem                                                          Size  Used Avail Use% Mounted on
udev                                                                1.6G  4.0K  1.6G   1% /dev
tmpfs                                                               360M  1.6M  358M   1% /run
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /lib/modules
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /home
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /xbmc-backup
/dev/mmcblk0p1                                                      100M   46M   55M  46% /boot
none                                                                4.0K     0  4.0K   0% /sys/fs/cgroup
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /run/xbian-storage
192.168.1.5:/volume1/3 Tera Backup/Backup/Raspberry-backups/Bureau  2.7T  2.5T  251G  91% /run/kodimap
/dev/mapper/loop0p2                                                 1.4G  1.3G  2.6M 100% /run/root-dest

And the usage of /dev/mapper/loop0p2 increases until the error.

tbertels commented 2 years ago

Here's the full xbiancopy.log

As I understand it, /dev/mapper/loop0p2 is the snapshot, so it's normal that it gets full. I still wonder why I get "No space left on device" type errors in btrfs.receive.log:

ERROR: writing to usr/include/linux/pkt_sched.h failed: No space left on device

or

ERROR: writing to usr/include/linux/sysctl.h failed: No space left on device

mkreisl commented 2 years ago

Yes, /dev/mapper/loop0p2 is the container for the image. But as you can see it is practically full. So btrfs is probably right with its error message. The question now is, why is the container so tight for you. For me this is never the case

btrfs receive writes this error message because this command writes into the container. That's correct

tbertels commented 2 years ago

Since it seems to need just a little more space, I tried enabling zlib compression, and it worked.

Here's the df -h just before the end:

Filesystem                                                          Size  Used Avail Use% Mounted on
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /
devtmpfs                                                            1.6G  4.0K  1.6G   1% /dev
none                                                                360M  1.5M  358M   1% /run
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /lib/modules
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /home
/dev/mmcblk0p1                                                      100M   46M   55M  46% /boot
none                                                                4.0K     0  4.0K   0% /sys/fs/cgroup
/dev/mmcblk0p2                                                       30G  2.3G   27G   8% /run/xbian-storage
192.168.1.5:/volume1/3 Tera Backup/Backup/Raspberry-backups/Bureau  2.7T  2.5T  251G  91% /run/kodimap
/dev/mapper/loop0p2                                                 1.4G  814M  500M  62% /run/root-dest

And the xbiancopy.log The btrfs.send.log and btrfs.receive.log contain no error.

mkreisl commented 2 years ago

The only question is how much, I checked today at noon, currently there is 20% more space than actually needed

https://github.com/xbianonpi/xbian-package-config-shell/blob/master/content/usr/local/sbin/xbian-storager#L1188

You could add a startvalue for variable size in bytes as shown below

                    size=<startvalueinbytes>
                    for v in $FS_LOCAL_LIST; do
                        [ -n "$v" ] || continue
                        for s in $snapextra; do
                            sv=$(btrfs-auto-snapshot volsize "$v/$s")
                            size=$((size+sv))
                        done
                    done

this is how it looks with a backup on one of my Pi's

/dev/mapper/loop0p2 5921792 5269068 536660 91% /tmp/m

tbertels commented 2 years ago

With size=51200000, I get

/dev/mapper/loop0p2 1515520 1378832 10992 100% /run/root-dest

and it works.

mkreisl commented 2 years ago

I have now made a kind of dynamic adjustment of the correction size. works here so far quite well

diff --git a/content/usr/local/sbin/xbian-storager b/content/usr/local/sbin/xbian-storager
index beda774..b2a5b13 100755
--- a/content/usr/local/sbin/xbian-storager
+++ b/content/usr/local/sbin/xbian-storager
@@ -45,6 +45,7 @@ opt_name=''
 opt_exclude="XXXXXXXX"
 opt_quiet=''
 opt_size=''
+opt_sizeadj='300M'

 # Default dynamcally set program options/variables for BTRFS/ZFS

@@ -1177,15 +1178,17 @@ EOF
                                        exit $E_SAME_FS
                                fi

+                               size=$(numfmt --from=iec $opt_sizeadj)
+                               size_fix=105
                                if [ -z "$opt_size" ]; then
                                        for v in $FS_LOCAL_LIST; do
                                                [ -n "$v" ] || continue
                                                for s in $snapextra; do
-                                                       sv=$(btrfs-auto-snapshot volsize "$v/$s")
+                                                       sv=$(xbian-storager volsize "$v/$s")
                                                        size=$((size+sv))
                                                done
                                        done
-                                       opt_size="$(($size*12/10/1024/1024))M"
+                                       opt_size="$(($size*$size_fix/100/1024/1024))M"
                                fi
                                size="$(numfmt --from=iec $opt_size)"

@@ -1365,6 +1368,33 @@ EOF
                                fi
                        fi

+                       optimize_size () {
+                               if [ "$opt_fstype" = btrfs ]; then
+                                       size=$(df -h | grep -m1 "$root_dest" | awk '{print $4}')
+                               else
+                                       size=$(zfs list -Ho space $opt_clonename | awk '{print $2}')
+                               fi
+                               [ "$BACKGROUND" == 1 ] && printf "Space remaining in %s: %s (%s)\n" "$(basename "$fn")" $size $opt_sizeadj
+
+                               size_avail=$(numfmt --from=iec $size)
+
+                               opt_sizeadj=$(numfmt --from=iec $opt_sizeadj)
+                               if [ "$size_avail" -gt "$(numfmt --from=iec 150M)" ]; then
+                                       opt_sizeadj=$((opt_sizeadj-50*1024*1024))
+                               elif [ "$size_avail" -lt "$(numfmt --from=iec 50M)" ]; then
+                                       opt_sizeadj=$((opt_sizeadj+50*1024*1024))
+                               else
+                                       opt_sizeadj=''
+                               fi
+                               if [ -n "$opt_sizeadj" ] && [ -e /etc/default/xbian-internals ]; then
+                                       if grep -q "OPT_SIZEADJ=" /etc/default/xbian-internals; then
+                                               sed -i "s/OPT_SIZEADJ=.*/OPT_SIZEADJ=$(numfmt --to=iec $opt_sizeadj)/g" /etc/default/xbian-internals
+                                       else
+                                               echo "OPT_SIZEADJ=$(numfmt --to=iec $opt_sizeadj)" >> /etc/default/xbian-internals
+                                       fi
+                               fi
+                       }
+
                        set -o pipefail
                        for v in $FS_LOCAL_LIST; do
                                if [ -z "$v" ] || echo "$v" | grep -qE "\.delete$|\.zlib$|\.lzo$|\.lz4$|\.zstd$"; then
@@ -1396,11 +1426,12 @@ EOF
                                                                to='--warning=none --acls --xattrs'
                                                                nice -n 10 tar c $to -C ${vs}/$s.ro . | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 tar x $to -C "${vd}/$s"
                                                        else
-                                                               nice -n 10 btrfs send "${vs}/$s.ro" | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 btrfs receive "${vd}"
+                                                               nice -n 10 btrfs send 2>/tmp/btrfs.send.log "${vs}/$s.ro" | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 btrfs receive 2>/tmp/btrfs.receive.log "${vd}"
                                                        fi
                                                else
                                                        nice -n 10 zfs send --props $v@ro | pv -i $ival -ns $siz | eval $opt_remotecmd nice -n 10 zfs receive -u "$opt_clonename/${v#*/}"
                                                fi
+                                               echo "ps=\"${PIPESTATUS[*]}\"" >/run/ps.xbiancopy
                                        }
                                        ar=0
                                        ( copy_volume $opt_cpmode ) 2>&1 | \
@@ -1415,7 +1446,13 @@ EOF
                                                                ar=$a
                                                        fi
                                                done
-                                       [ "$?" -eq 0 ] || { [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed"; exit $E_COPY_DATA; }
+                                       . /run/ps.xbiancopy && rm -f /run/ps.xbiancopy
+                                       for RC in $ps; do
+                                               if [ "$RC" -ne 0 ]; then
+                                                       optimize_size
+                                                       [ "$BACKGROUND" == 1 ] && printf "%s\n" "$wmsg subvolume $v snapshot $s failed (ps=$ps)"; exit $E_COPY_DATA
+                                               fi
+                                       done
                                        if [ "$opt_fstype" = btrfs ]; then
                                                if [ "$opt_cpmode" = tar ]; then
                                                        eval $opt_remotecmd nice -n 10 btrfs fi sync "${vd}/$s" && sync && btrfs sub delete "${vs}/$s.ro" || exit $E_SUB_DELETE
@@ -1435,12 +1472,13 @@ EOF
                                [ "$BACKGROUND" == 1 ] && printf "%s\n" "Done processing subvolume $v"
                        done

+                       optimize_size
+
                        if [ "$opt_fstype" = zfs ]; then
                                vdr="$(zpool get -H bootfs -o value "$opt_rpool" | sed "s%$opt_rpool%$opt_clonename%g")"
                                eval $opt_remotecmd zpool set bootfs=$vdr $opt_clonename
                                echo "vdr=$vdr" >>/run/vars.xbiancopy.$$
                        fi
-
                        if [ -n "$opt_recompress" ]; then
                                for v in $FS_LOCAL_LIST; do
                                        if [ -z "$v" ] || echo "$v" | grep -qE "\.delete$|\.zlib$|\.lzo$|\.lz4$|\.zstd$"; then
@@ -1813,7 +1851,8 @@ esac
 [ -e /etc/default/xbian-internals ] && . /etc/default/xbian-internals
 [ -z "$OPT_COPY_MODE" ] || opt_cpmode="$OPT_COPY_MODE"
 [ -z "$OPT_COMPRESS" ]  || opt_compress="$OPT_COMPRESS"
-[ -z "$OPT_REMOTECMD" ] || opt_remotecmd="OPT_REMOTECMD"
+[ -z "$OPT_REMOTECMD" ] || opt_remotecmd="$OPT_REMOTECMD"
+[ -z "$OPT_SIZEADJ" ]   || opt_sizeadj="$OPT_SIZEADJ"

 if [ "$1" != '//' -a "${1%%-*}" != '' ]; then
        opt_command="$1"
@@ -2165,7 +2204,7 @@ elif [ "$opt_command" = volsize ]; then
                do_run mount -t btrfs $opt_dev $md 2>/dev/null
                [ ! -d $md/$1 ] || printf "$(du -xsb $md/$1 | awk '{print $1}')\n"
        else
-               do_run zfs list -Hpo used ${1%%/@} 2>/dev/null
+               do_run zfs list -Hpo space ${1%%/@} 2>/dev/null | awk '{printf "%d\n", $5}'
        fi
        exit 0
 elif [ "$opt_command" = fstype ]; then

Since it is nearly impossible to determine the exact size of the required image due to metadata and other imponderables, a dynamic adjustment is used. The value is set for each run so that the free space in the image is around 100M

tbertels commented 2 years ago

I'm having a hard time merging those changes on Windows (I did it manually for the other ones). Could you attach the modified file?

mkreisl commented 2 years ago

This should be much easier:

1) download the patch 2) ssh into xbian 3) cd /usr/local/sbin 4) sudo patch -lp5 < /pathtopatch/downloadedpatch

Oh I forgot, of course restore the original state with sudo apt-get install --reinstall xbian-package-config-shell first

tbertels commented 2 years ago

Thanks, the patching works fine with that command, even on Windows (at least with the patch tool from Git, not the one from GnuWin32).

It works here too, with 77M (300M) remaining.

mkreisl commented 2 years ago

Great, so it seems that 300M start value will be a good choice.

tbertels commented 2 years ago

It works fine too on an other rpi4, with 62M remaining there.

mkreisl commented 2 years ago

After a longer test and some corrections, this should now be fixed with this commit: https://github.com/xbianonpi/xbian-package-config-shell/commit/97e54ab916f73880b15042d9a084cd4b7fad3bf4

tbertels commented 2 years ago

An adaptive size change, it should be future proof!

xbianonpi / xbian-package-config-shell

System backup fails with error E_COPY_DATA #38