zmanda / amanda

Amanda Network Backup
https://www.zmanda.com/downloads/
Other
214 stars 107 forks source link

Amanda estimating incrementals but dumping full backups #255

Closed DuckCurious closed 1 month ago

DuckCurious commented 1 month ago

I am running Amanda in a jail on a FreeBSD TrueNAS server. The files on the NAS are mounted in the jail on several mountpoints with paths of the type /mnt/PathToFiles. With those paths set up as DLEs in the disklist file, Amanda backs them up ok and works as expected... until the jail is restarted due to a server powercycle or a power outage. When this happens, Amanda gets completely messed up, and I have not been able to figure out why.

The behaviour I am seeing is that after a jail restart I run a dump as normal, and Amanda calculates her estimates just fine, mostly level 1 or 2 incrementals of between 0 and 30GB, depending on the DLE. But when she goes ahead and starts dumping to the holding disk, every single file from each DLE is being dumped, as though it was doing a level 0 backup - even though amstatus is showing the same level 1 and 2 backups along with the expected, much smaller, estimated backup sizes.

It is as though whatever method she is using to dump the files is different to the one she uses to do the estimate, and she is just blindly dumping everything. If anyone knows what might be causing this behaviour I would be very interested to find out what is going on, as currently the only way I can get things back on track is by forcing full backups and starting from scratch.

I am using Amanda version 3.5.1.

konidev20 commented 1 month ago

Hey @DuckCurious, can you share your config file? Also, it would be great if you can share the amreport and amstatus output for these.

Refer amstatus usage: https://wiki.zmanda.com/man/3.5.1/amstatus.8.html

Refer amreport usage: https://wiki.zmanda.com/man/3.5.1/amreport.8.html

If the TrueNAS is not restarted, do incremental backups work as expected?

Also, the incremental changes picked up depends on the underlying tool configured for the dump process: https://wiki.zmanda.com/index.php/FAQ:How_are_backup_levels_defined_and_how_does_Amanda_use_them%3F

DuckCurious commented 1 month ago

Thanks for replying, here is my amanda.conf (it's by no means ideal as it has been used to experiment with a lot of different ideas over time, but it does seem to get the job done) -

org "nasbackup" 
dumpuser "amanda"
mailto "redacted@gmail.com"
tapedev "/dev/nsa0"

dumpcycle 28 days               
tapecycle 16
runspercycle 14  
device_property "LEOM" "TRUE"
eject-volume yes
debug_amrecover 5

dtimeout 43200  # number of idle seconds before a dump is aborted
ctimeout 30     # max number of seconds amcheck waits for each client
etimeout 43200  # number of seconds per filesystem for estimates

reserve 20      # Reserve at least 20% of holding disk for incremental (not level 0) backups.

define policy nasbackup {
    retention-tapes   5 tapes
    retention-days    0
    retention-recover 0
    retention-full    0
}

define taperscan lexical-scan {
    plugin "lexical"
}

define tapetype myUltriumTapeType {
    comment "Created by amtapetype; compression enabled"
    length 1391769312 kbytes # Took out a bit more length, take off more if fails continue
    filemark 374 kbytes
    speed 121045 kps
    blocksize 32 kbytes
    part_size 150G # Split into 150GB parts
}

define storage nasbackup {
    policy "nasbackup" 
    tapepool "nasbackup"  
    runtapes 1          # number of tapes to be used in a single run of amdump (testing with 1)
    tapetype "myUltriumTapeType" # CHANGED TO USE THE TAPETYPE WE CREATED
    labelstr "^nasbackup-[0-9][0-9]*$"      # label constraint regex: all tapes must match
    taperscan "lexical-scan"
    taperalgo largestfit

    flush-threshold-scheduled 80    
    autoflush all   
}

holdingdisk hd1 {
    comment "Main holding disk"
    directory "/mnt/holdingdisk"
}

storage "nasbackup"

define dumptype global {
       comment "Global definitions"
       auth "bsdtcp"
       estimate calcsize
}

define dumptype gui-base {
       global
       program "GNUTAR"
       comment "gui base dumptype dumped with tar"
       compress none
       index yes
       maxpromoteday 0  # Disable backup promotion
       bumpsize 100 G   # Do +1 level backup if this amount wil be saved
       estimate calcsize
}

define dumptype gui-base-maxpromote-21 {
       global
       program "GNUTAR"
       comment "gui base dumptype dumped with tar"
       compress none
       index yes
       maxpromoteday 21 
       bumpsize 100 G   # Do +1 level backup if this amount wil be saved
       estimate calcsize 
}

define dumptype gui-base-noinc {
       global
       program "GNUTAR"
       comment "gui base dumptype dumped with tar"
       compress none
       index yes
       maxpromoteday 0
       strategy noinc   # Only do full backups
       bumpsize 10 G    # Do +1 level backup if this amount wil be saved
       estimate calcsize
}

define dumptype gui-base-incronly {
       global
       program "GNUTAR"
       comment "gui base dumptype dumped with tar"
       compress none
       index yes
       maxpromoteday 0  
       strategy incronly    # Don't run full backups, only incrementals
       bumpsize 10 G    # Do +1 level backup if this amount wil be saved 
       estimate calcsize
}

define interactivity "bytty" {
       comment "Use terminal to display prompts"
       plugin "tty"
}

interactivity "bytty"

includefile "advanced.conf"
includefile "/usr/local/etc/amanda/template.d/dumptypes"
includefile "/usr/local/etc/amanda/template.d/tapetypes"

Here is the output from amstatus (using an amdump file that was affected by this issue). (DLE names have been redacted to protect the innocent)

% amstatus nasbackup --file amdump.20240601042428
Using: /usr/local/etc/amanda/nasbackup/amdump.20240601042428
From Sat Jun  1 04:24:28 ACST 2024

localhost:/mnt/xxxxxxxxx                   20240601042428 1        24g terminated while waiting for dumping
localhost:/mnt/xxxx/xxxxxxxxxx             20240601042428 1         0g terminated while waiting for dumping
localhost:/mnt/xxxx/xxxxxxx                20240601042428 1         0g terminated while waiting for dumping
localhost:/mnt/xxxx/xxxxxxxx               20240601042428 1         0g dump failed: dumper CRASH (15:04:43)
localhost:/mnt/yyyy/yyyyyyyy               20240601042428 1       710g dump done (12:25:24)
localhost:/mnt/xxxxx                       20240601042428 1         0g terminated while waiting for dumping
localhost:/mnt/xxxx                        20240601042428 1        32g terminated while waiting for dumping
localhost:/mnt/xxxxxxxxxxxxxx/xxxxxxxxxxxx 20240601042428 1       115g dump done ( 5:27:03)
localhost:/mnt/xxxxxxxxxxxxxx/xxxxxx       20240601042428 1        17g terminated while waiting for dumping
localhost:/mnt/xxxxxxxxxxxxxx/xxxxxxx      20240601042428 2       170g terminated while waiting for dumping
localhost:/mnt/xxxxxxxxxxxxxx/xxxxxxx      20240601042428 1        10g terminated while waiting for dumping
localhost:/mnt/xxxxxxxxxxx                 20240601042428 1         0g terminated while waiting for dumping

SUMMARY           dle       real  estimated
                            size       size
---------------- ----  ---------  ---------
disk            :  12
estimated       :  12                  255g
flush
dump failed     :   1                    0g           (  0.00%)
wait for dumping:   9                  255g           (  0.00%)
dumping to tape :   0         0g         0g (  0.00%) (  0.00%)
dumping         :   0         0g         0g (  0.00%) (  0.00%)
dumped          :   2       826g         0g (1019186541.18%) (323.60%)
wait for writing
wait to flush
writing to tape
dumping to tape
failed to tape
taped

4 dumpers idle  : no-bandwidth
network free kps: 7990
holding space   : 0g (100.00%)
chunker0 busy   :  7:59:45  ( 74.94%)
dumper0 busy    : 10:38:30  ( 99.73%)
 0 dumpers busy :  0:07:41  (  1.20%)        no-bandwidth:  0:06:33  ( 85.18%)
                                               no-dumpers:  0:00:05  (  1.14%)
                                                 not-idle:  0:01:03  ( 13.68%)
 1 dumper busy  : 10:32:31  ( 98.80%)        no-bandwidth: 10:32:31  (100.00%)

The DLE /mnt/yyyy/yyyyyyyy is the most obvious indication of what is happening - the total estimate for the backup was 255GB, and that DLE alone was far more than that. I know for certain that there were no changes to the files in that directory whatsoever, and the estimate given for that DLE was 0GB. And yet is has dumped every file in the directory.

I terminated the dump before it went any further, as otherwise it would have dumped every single file in every single DLE.


Here is the output from amreport -

% amreport nasbackup --log=log.20240601042428.0
Hostname: Amanda
Org     : nasbackup
Config  : nasbackup
Date    : June 1, 2024

The next tape Amanda expects to use is: nasbackup-2.

FAILURE DUMP SUMMARY:
  localhost /mnt/xxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxx/xxxxxxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxx/xxxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxxxxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxxxxxxxxxxx/xxxxxxx lev 2  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxxxxxxxxxxx/xxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxxxxxxxxxxx/xxxxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxxx lev 1  FAILED [out of holding space in degraded mode]
  localhost /mnt/xxxx/xxxxxxxx lev 1  FAILED [Not enough holding disk space]

STATISTICS:
                          Total       Full      Incr.   Level:#
                        --------   --------   --------  --------
Estimate Time (hrs:min)     0:01
Run Time (hrs:min)         10:42
Dump Time (hrs:min)         8:00       0:00       8:00
Output Size (meg)       846004.5        0.0   846004.5
Original Size (meg)     846004.5        0.0   846004.5
Avg Compressed Size (%)    100.0        --       100.0
DLEs Dumped                    2          0          2  1:2
Avg Dump Rate (k/s)      30101.6        --     30101.6

Tape Time (hrs:min)         0:00       0:00       0:00
Tape Size (meg)              0.0        0.0        0.0
Tape Used (%)                0.0        0.0        0.0
DLEs Taped                     0          0          0
Parts Taped                    0          0          0
Avg Tp Write Rate (k/s)      --         --         --

NOTES:
  planner: tapecycle (5) <= runspercycle (14)
  planner: Last full dump of localhost:/mnt/xxxx on tape nasbackup-6 overwritten in 5 runs.
  planner: Last full dump of localhost:/mnt/xxxxx on tape nasbackup-2 overwritten in 1 run.
  driver: dumper0 died while dumping localhost:/mnt/xxxx/xxxxxxxx lev 1.

DUMP SUMMARY:
                                                                        DUMPER STATS    TAPER STATS
HOSTNAME     DISK                             L ORIG-GB  OUT-GB  COMP%  MMM:SS    KB/s MMM:SS   KB/s
----------------------------------------------- ---------------------- --------------- -------------
localhost    /mnt/xxxxxxxxx                                                FAILED
localhost    /mnt/xxxx/xxxxxxxxxx                                          FAILED
localhost    /mnt/xxxx/xxxxxxx                                             FAILED
localhost    /mnt/xxxx/xxxxxxxx               1                    --      FAILED
localhost    /mnt/yyyy/yyyyyyyy               1     710     710    --   418:11 29685.4
localhost    /mnt/xxxxx                                                    FAILED
localhost    /mnt/xxxx                                                     FAILED
localhost    /mnt/xxxxxxxxxxxxxx/xxxxxxxxxxxx 1     116     116    --    61:28 32932.9
localhost    /mnt/xxxxxxxxxxxxxx/xxxxxx                                    FAILED
localhost    /mnt/xxxxxxxxxxxxxx/xxxxxxx                                   FAILED
localhost    /mnt/xxxxxxxxxxxxxx/xxxxxxx                                   FAILED
localhost    /mnt/xxxxxxxxxxx                                              FAILED

(brought to you by Amanda version 3.5.1)

The incrementals work perfectly as long as the jail stays up and running, but whenever the jail is restarted I get the behaviour shown above. I don't understand how it can get the estimates right on the one hand, but then do something completely different when it does the actual dump - it clearly can identify the files that need to be included in the incrementals, but seemingly ignores that when dumping to the holding disk.

Let me know if you would like any more information, I really would like to know what is going on here...

konidev20 commented 1 month ago

Hey @DuckCurious by looking at your config file, it looks like you are using the calcsize for estimate from each dumptype.

Refer: https://wiki.zmanda.com/man/3.5.1/amanda.conf.5.html

Can you set the estimate to client for your DLEs?

The estimates will be slower but will be the same logic used by the dumper logic.

I will check how calcsize works in more detail.

I am thinking it's something to do with GNUTARs incremental backups which might be getting incorrect information after a TrueNAS restart.

[UPDATE] Optional: try this test. Check next comment. [/UPDATE]

konidev20 commented 1 month ago

I reviewed how calcsize works:

I also reviewed how sendbackup works for gnutar:

Since calcsize is actually checking the modified time against the last backups, I think it's able to estimate correctly. But during backup with GNU Tar, the folder with previous backup info might be reset. If you check this and get back, it would be great.

DuckCurious commented 1 month ago

Thanks so much for looking into this konidev20, that is some great information. I'll see what I can find and report back.

calhariz commented 1 month ago

I had the same problem in the past in Linux. When the system reboots it changes the device number of file system, and gnu tar look into that number during incrementals.

In amanda configuration of GNUTAR try to add the following:

property "CHECK-DEVICE" "NO"

On Mon, Jun 3, 2024 at 2:49 AM DuckCurious @.***> wrote:

Thanks so much for looking into this konidev20, that is some great information. I'll see what I can find and report back.

— Reply to this email directly, view it on GitHub https://github.com/zmanda/amanda/issues/255#issuecomment-2144137479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPPNMOHN6LKYBXZE3LLO3ZFPDSZAVCNFSM6AAAAABIUYRY5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGEZTONBXHE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

konidev20 commented 1 month ago

@DuckCurious you might need to use amgtar application to enable that property. Refer: https://wiki.zmanda.com/man/3.5.1/amgtar.8.html

@calhariz, is this property honored for the default GNUTAR application?

calhariz commented 1 month ago

I do not know because in my head GNUTAR and amgtar are to much similar on amanda. Both use gnutar. So I started to use amgtar for some time on all my installations of amanda.

In my home installation have this:

define application-tool amgtar { plugin "amgtar" property "CHECK-DEVICE" "NO" }

On Mon, Jun 3, 2024 at 3:31 PM Srigovind Nayak @.***> wrote:

@DuckCurious https://github.com/DuckCurious you might need to use amgtar application to enable that property. Refer: https://wiki.zmanda.com/man/3.5.1/amgtar.8.html

@calhariz https://github.com/calhariz, is this property honored for the default GNUTAR application?

— Reply to this email directly, view it on GitHub https://github.com/zmanda/amanda/issues/255#issuecomment-2145357872, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPPNLNG2SAWBULWBYV4NLZFR45VAVCNFSM6AAAAABIUYRY5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGM2TOOBXGI . You are receiving this because you were mentioned.Message ID: @.***>

konidev20 commented 1 month ago

amgtar is a wrapper around GNUTAR. It's considered as an application source rather than a filesystem source I reckon. GNUTAR backups use default properties for backup.

amgtar is more flexible and allows configuration of more tar parameters for each dump. So setting the "CHECK-DEVICE" "NO" might resolve the issue.

Thanks @calhariz

DuckCurious commented 1 month ago

I checked the gnutar-lists directory and it does seem to be maintaining a list of binary files as expected, with names relating to the DLEs and backup levels. It has references to DLEs I used for testing months ago which are no longer in the disklist file, so it does appear that gnutar has been keeping those lists even after multiple system restarts.

In light of calhariz post it does look as though the issue is probably related to gnutar, so once I have my backups working correctly again I will try using the amgtar wrapper and disabling the device number check. I'll leave the calcsize estimate option in for now, as the difference in estimation time between that and the client option is non-trivial (I tested with client estimates initially and switched to calcsize because it took a few seconds compared to many hours!)

Thanks for your assistance, it seems there might be some light at the end of the tunnel. I'll post back with the results.

DuckCurious commented 1 month ago

There has been some progress with this issue, I ran a couple of dumps to get Amanda to a stage where she was doing incremental backups again and then restarted the TrueNAS server. I added this to amanda.conf -

define application-tool app_amgtar {
    plugin "amgtar"
    property "CHECK-DEVICE" "NO"
}

and changed all my dumptype definitions to use app_amgtar instead of GNUTAR, eg -

define dumptype gui-base {
       global
       #program "GNUTAR"
       program "APPLICATION"
       application "app_amgtar"
       comment "gui base dumptype dumped with tar"
       compress none
       index yes
       maxpromoteday 0  # Disable backup promotion
       bumpsize 100 G   # Do +1 level backup if this amount wil be saved
       estimate calcsize
}

I then attempted to do an amdump, and got a bunch of errors in the amdump log that looked like this -

amgtar: 0: strange(?): /usr/local/bin/gtar: --atime-preserve='system' is not supported on this platform

I did some digging and discovered that while the version of gnutar on my TrueNAS server does support this option, it is reliant on the underlying OS to work as expected, and it seems that FreeBSD is not one of them (link below is where I got this info from, it is old but apparently still applies in this case).

https://www.gnu.org/software/tar/manual/html_node/Option-Summary.html

From the FreeBSD man page for amgtar -

ATIME-PRESERVE

If "YES" (the default), amgtar use the --atime-preserve=system option of gnutar to not update the atime of all files accessed; if "NO", gnutar will updates the atime for all files accessed. This property works only if you have gnutar 1.15.90 or newer, if not, you must set ATIME_PRESERVE to "NO".

My version of gtar -

% gtar --version
tar (GNU tar) 1.35

I would have thought this option would work with this version of gtar, but it just isn't. So without seeing any other option I changed my app_amgtar definition to -

define application-tool app_amgtar {
    plugin "amgtar"
    property "CHECK-DEVICE" "NO"
    property "ATIME_PRESERVE" "NO"
}

And sure enough, the backup ran just fine, and wrote the level 1 dumps to the holding disk just as Amanda calculated in her estimate!

It's a bit annoying that the default option doesn't work, and therefore I guess the atimes for the accessed files are being updated when ideally that would not happen. But this is still a massive improvement compared to what was happening originally, so I'm calling this a win.

Thanks @konidev20 and @calhariz for your help in resolving this.

konidev20 commented 1 month ago

Closing this since there are no pending actions from me.