zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
598 stars 79 forks source link

Index not auto-updating #298

Closed drscotthawley closed 2 years ago

drscotthawley commented 2 years ago

Hi, sorry if I missed something basic in the documentation. I thought the point of duc was that, unlike du which has to be run each time you want to know disk usage, duc maintains an index of usage and runs fast, so that I run duc index once (which takes about as long as du) and then can get usage info really fast, much faster than du.

But I'm noticing that when my directories grow, duc ls keeps showing the same old size from when it was initially indexed, i.e. it is not updating to track changes.

How do we enable this?

(If I have to re-run duc index every time I want to see a valid usage list, I might as well just run du. Not interested in graphs, etc., just fast usage info.)

l8gravely commented 2 years ago

"Scott" == Scott H Hawley @.***> writes:

Scott> Hi, sorry if I missed something basic in the documentation. I Scott> thought the point of duc was that, unlike du which has to be Scott> run each time you want to know disk usage, duc maintains an Scott> index of usage and runs fast, so that I run duc index once Scott> (which takes about as long as du) and then can get usage info Scott> really fast, much faster than du.

duc needs to re-index if you want to see any changes made since the last index was run. All duc does is query the index when you ask it questions, it doesn't re-index the disk(s).

Scott> But I'm noticing that when my directories grow, duc ls keeps Scott> showing the same old size from when it was initially indexed, Scott> i.e. it is not updating to track changes.

Corret.

Scott> How do we enable this?

You really don't want this to happen that often. Think how painful running 'du' all the time can be. What most people do is have duc run nightly (or weekly, or whatever schedule) to update the index.

Scott> (If I have to re-run duc index every time I want to see a valid Scott> usage list, I might as well just run du. Not interested in Scott> graphs, etc., just fast usage info.)

Sure, that's what you can do.

Duc really isn't designed for a single user to use on their laptop, it's more for large filesystems on large system which would days hours to complete a 'du' scan, and which would put an unacceptable load on the server and/or/storage subsystem.

By building the index, you can investigate the system and drill down into the details (hey, this directory is now 2tb in size, why?) without having to rerun lots of 'du' commands.

John

privnote42 commented 2 years ago

Hi, did i understand correctly that I can not update the index with duc, but must scan all files and folders again and again, although they have not changed?

l8gravely commented 2 years ago

did i understand correctly that I can not update the index with duc, but must scan all files and folders again and again, although they have not changed?

Yes you have to re-scan all the files, because otherwise how will duc know when there have been changes? But yes, you can also rebuild the index, though I've found it simpler to just have a cronjob which does:

   foreach f in filesystems; do
           duc index -d /tmp/$f.db $f
       if [ $? ]; then
          mv /tmp/$f.db /read/path/to/dbs/$f.db
           else 
              echo "error indexing $f, db not updated"
       fi
   done

This is just off the top of my head, and is probably wrong, but the idea is there. If the index builds properly, then move it over the old index. Otherwise bail out.

The idea behind dus is to amortise the cost of a single index run across many accesses to the DB, which is just so much faster. I have some 10tb filesystems with 30 million files. Not having to run 'du' all the time to see what changed is fantastic.

Cheers, John

luckycloud-GmbH commented 2 years ago

duc could check the timestamps of the subfolders, couldn't it?

stuartthebruce commented 2 years ago

duc could check the timestamps of the subfolders, couldn't it?

Only if it can also confirm that there are no sub-directories, c.f.,-noleaf option to GNU find.

luckycloud-GmbH commented 2 years ago

ok, that's a valid point. But wouldn't it still be faster to recursively search for new files in sub-directories instead of indexing everything again and again?

l8gravely commented 2 years ago

"stuartthebruce" == stuartthebruce @.***> writes:

duc could check the timestamps of the subfolders, couldn't it?

Only if it can also confirm that there are no sub-directories, c.f.,-noleaf option to GNU find.

As Stuart says, there's no way to look at the directory timestamp to know if files/directories have changed more than one level below. Which is why you have to rescan.

And when you have 10Tb of data with 3 million files or more, you don't want to scan very often, you just want to be able to target the low hanging fruit.

Now what might be interesting is a way to find the top N largest files, since they give the most bang for the buck in terms of reducing filesystem usage. I've got a perl script I've used in the past for this, which would email my users. And another script which looked at Netapp quota reports and emailed users as well.

It's a multi-faceted problem space, because collecting the data is expensive, so it really cannot be done realtime.

John

l8gravely commented 2 years ago

"luckycloud-GmbH" == luckycloud-GmbH @.***> writes:

ok, that's a valid point.

But wouldn't it still be faster to recursively search for new files in sub-directories instead of indexing everything again and again?

Nope, because looking for new files (using find say) is just like duc indexing. It's the same

func findit() { opendir() while readdir() { if dir findit(dir) if file add_to_index } closedir() }

loop, with recursion. Scanning the fileystem is slow when you get to large filesystems, which is why duc only does it once, unless you ask it to re-index to find changes.

Now maybe there's an idea to somehow keep two copies so you can show the change between runs in the DB, but I've spent no more than two seconds thinking about the issues there.

drscotthawley commented 2 years ago

I feel like my question has been answered, given that I misunderstood the basic principles of duc's operation.
Might file this under a "Feature Request" for clarification added to the documentation.

Presumably I could do a cron job that reindexes, say once daily. Related precedent are disk-usage utilities for Windows & Mac where you need to manually re-scan in order for the information to be up-to-date.

So, I understand others may still have issues and questions, but since I'm the one who opened this issue and I consider it to be resolved, I'm closing it.

michaelfresco commented 1 year ago

Cron job that reindexes, say once daily.

Good question. The manual is not very explicit about this.

dantheperson commented 1 year ago

Yes you have to re-scan all the files, because otherwise how will duc know when there have been changes?

Couldn't you get file change events with inotify to identify when a file needs updating in the index?

l8gravely commented 1 year ago

"dantheperson" == dantheperson @.***> writes:

Yes you have to re-scan all the files, because otherwise how will duc know when there have
been changes?

Couldn't you get file change events with inotify to identify when a file needs updating in the index?

That would imply that duc indexer is running all the time, and that we can insert changes into the middle efficiently. I'd want to batch changes as well.

But remember, duc isn't for real time stats, it's for large volumes with lots and lots of files that takes forever to search by hand. duc does all that work for you and let's you visually mine it for the problem spots.