RFE histogram - Githubissues

stuartthebruce commented 2 years ago

I think it would be helpful to have a histogram command to see the distribution of file and directory sizes. When migrating large collections of files between filesystems it is helpful to have insight into these distributions to help tune the target filesystem (or contact the owner to cleanup first if they have a problematic distribution, e.g., too many small files or too many files in some very large directories). Thanks.

l8gravely commented 2 years ago

"stuartthebruce" == stuartthebruce @.***> writes:

Sorry for the late reply, life got in the way...

stuartthebruce> I think it would be helpful to have a histogram stuartthebruce> command to see the distribution of file and directory stuartthebruce> sizes. When migrating large collections of files stuartthebruce> between filesystems it is helpful to have insight into stuartthebruce> these distributions to help tune the target filesystem stuartthebruce> (or contact the owner to cleanup first if they have a stuartthebruce> problematic distribution, e.g., too many small files stuartthebruce> or too many files in some very large stuartthebruce> directories). Thanks.

It's an interesting idea, but we don't currently have the infrastructure for this in duc currently, since we only keep size counts on a per-directory basis.

We could fairly easily add in per-index data like this, where we keep running totals of files in various size buckets. Not sure how granular this should be. Should it be dynamic, so that we break it down by 5% chunks? But then a single huge file would skew the results and make it harder to compare.

Do we break it down by 512 byte blocks where we report on files which are 1,2,4,8,16,32,...1024 or larger in size? This also hides the fact that a single large file is one of those targets for cleanup since it's generally either something you want, or something you should cleanup since a single file makes such a large difference.

Maybe it should be based in the filesystem size? Scale the buckets as the filesystem gets bigger, so that a 10gb filesystem looks different than a 100gb, 1tb or 10tb or even bigger.

Now going down and keeping per-directory histogram stats is going to big a much bigger issue. People have asked for stuff like this in the past, but we're resisted because it's going to explode the DB size.

I've got multiple filesystems with 30+ millions files, and checking all those takes a long time to complete. And there are people who are using duc on even bigger data stores. One of duc's strengths in my mind is that the DB is built outside of the GUI, so that you can explore the data, while not having to wait six, twelve or more hours for the data to be collected, like windirstat and tools like that work.

Cheers, John

stuartthebruce commented 2 years ago

John, thanks for giving this some serious thought. I also use duc to regularly index a large number of files (5B nightly) and it does a fantastic job of that!

If you agree this is worth pursuing I also agree that one design goal should be to not significantly increase the database size. And I suspect a reasonable place to start would be with dynamic historgram bin sizes on a logarithmic scale to minimize how far an individual large file skews the bin sizes.

Another perhaps simpler idea would be to support a user setting for min and max file sizes for many (all?) of the database view commands, i.e., everything but index. This would enable a user to make their own decision on what file size(s) they want to zoom in on, e.g., where are all my <1kByte files?

l8gravely commented 2 years ago

"stuartthebruce" == stuartthebruce @.***> writes:

stuartthebruce> John, thanks for giving this some serious thought. I stuartthebruce> also use duc to regularly index a large number of stuartthebruce> files (5B nightly) and it does a fantastic job of stuartthebruce> that!

We're glad to hear it, it's a fantastic tool. This is one of the areas that filesystems could use better work in, but I realize also how expensive it would be to keep this info upto date.

stuartthebruce> If you agree this is worth pursuing I also agree that stuartthebruce> one design goal should be to not significantly stuartthebruce> increase the database size. And I suspect a reasonable stuartthebruce> place to start would be with dynamic historgram bin stuartthebruce> sizes on a logarithmic scale to minimize how far an stuartthebruce> individual large file skews the bin sizes.

Yeah, that might work. Now were you thinking of keeping the histogram on a per-directory basis or just at the top level of the entire DB? That would be the much simpler thing to do at first, then possible extending it down to per-directory levels.

The hard part is how to display this information effectively, but in the web gui, the CLI and ncurses interfaces. I haven't really gotten that far yet.

stuartthebruce> Another perhaps simpler idea would be to support a stuartthebruce> user setting for min and max file sizes for many stuartthebruce> (all?) of the database view commands, i.e., everything stuartthebruce> but index. This would enable a user to make their own stuartthebruce> decision on what file size(s) they want to zoom in on, stuartthebruce> e.g., where are all my <1kByte files?

That would be hard to do because then you would need to rescan the filesystem to get the proper data. With a goal of keeping the information overhead as small as possible, the bucket sizes would need to be picked ahead of time, so that the stored counts would be limited.

As a first plan, I would just setup a top-level structure to hold the totals for the db, and as directories are accessed, the size distribution they report would be simply summed with the top level totals.

Once that is working, then it would be pretty trivial I think to add in a per-directoy database entry to hold this structure. But this is where the DB size increase would happen in a major way, especially as the number of directories increases. I don't have the per-directory data structure size in my head, so I'd have to dig through the code to get a good idea for how this would work.

Doing a packed data structure of some sort would be best, so as to not keep counters for buckets that aren't used. Of course as you get higher up the tree, the chance of having empty buckets would go down quite a bit.

Something to look into when I get a chance.

John

stuartthebruce commented 2 years ago

Yeah, that might work. Now were you thinking of keeping the histogram on a per-directory basis or just at the top level of the entire DB? That would be the much simpler thing to do at first, then possible extending it down to per-directory levels.

I was thinking of a per-directory basis, however, starting with just a top-level histogram sounds like a good idea to expose many of the interesting data structure and user interface questions.

That would be hard to do because then you would need to rescan the filesystem to get the proper data. With a goal of keeping the information overhead as small as possible, the bucket sizes would need to be picked ahead of time, so that the stored counts would be limited.

But the rescan would be against the database and not the filesystem, so in some cases it may be worth waiting for that to rescan.

Doing a packed data structure of some sort would be best, so as to not keep counters for buckets that aren't used. Of course as you get higher up the tree, the chance of having empty buckets would go down quite a bit.

Perhpas the same idea, but it might make sense to automatically roll up directory branches until a minimum number of files have been accumulated, i.e., do not dramatically increase individual directory DB sizes for directories that only have a few files in them (or below them).

l8gravely commented 4 months ago

This idea has languished for quite a while and I have no plans to implement this because even now I don't have a good understanding of what you want. I guess I'd like to see some examples of what you're looking for.

Now just keeping a running total of numbers of files in size buckets wouldn't be too hard, choosing the bucket sizes would be. Something like: 0, <1k, 2,k, 4k, .... 16gb+ might be the right answer here. It would certainly be an interesting project to play with. Have to think how to add this into indexing and then to display it. Hmm... now that I'm doing more with duc again, I'm a bit inspired to try this.

l8gravely commented 4 months ago

So I've just pushed a new branch called 'histogram' based on the 'tkrzw' work to add histograms to the DB and the ability to show them using the 'duc info -H' command. It's an initial state and the docs still need to be updated. Here's an example:

$ ./duc info -d /tmp/tkrzw.db -H Date Time Files Dirs Size Path 2024-05-24 13:39:44 732.9K 102.5K 449.6G /scratch

Histogram:

2^0 24837 2^1 584 2^2 2976 2^3 4438 2^4 6334 2^5 5548 2^6 12586 2^7 32702 2^8 45131 2^9 65917 2^10 75004 2^11 78314 2^12 79386 2^13 60569 2^14 52032 2^15 48982 2^16 34777 2^17 20251 2^18 15619 2^19 12251 2^20 15874 2^21 9982 2^22 23961 2^23 2812 2^24 1680 2^25 143 2^26 69 2^27 48 2^28 13 2^29 19 2^30 13 2^31 4 2^33 3 2^36 1

It currently has buckets done in powers of two. Which gives us a break down of 0, 1, 2, 4, 8, etc byte files until we get upto higher numbers. I don't think this is ideal, but works so far. I need to turn that 2^n into real human readable numbers.
need to make the bucket sizing more dynamic so we don't hit problems.
better bucket algorithm to spread things.
better output of bucket sizes in a more human readable format.
do we need to add JSON output support?
does it need to be graphed? Or touched in the ui/gui/web interfaces?

stuartthebruce commented 4 months ago

Is that example a histogram of individual file sizes? If so, I think it would also be useful to be able to histogram directory sizes.

l8gravely commented 4 months ago

"stuartthebruce" == stuartthebruce @.***> writes:

Is that example a histogram of individual file sizes? If so, I think it would also be useful to be able to histogram directory sizes.

It's just a histogram of filesizes currently. I'm planning on adding summation to the plot, so you can see what percentage of filesystem space usage is for each bucket.

The real trick is coming up with the bucket sizing that seems to fit better for most people's needs. Zero length files is the first bucket, then files upto 1k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1g,2g,4g,8g... does that work for people?

John

stuartthebruce commented 4 months ago

I like the idea of bucket boundaries being power of 2, and if you want to add a user argument it could be how many powers of 2 for each bucket, with a default of 1 to provide single octave resolution or 10 for coarser grained k/M/G/T.

zevv commented 4 months ago

Hey folks, long time no talk. I must admit I kind of ignored most of the duc-related traffic as of late, apologies for the lack of time and love spent on this project. @l8gravely I really do appreciate your effors of keeping things going, thank you for that!

I do like this histogram feature though, I do agree with the exponential bin size, base 2 and 10 both make sense so that would make a nice option.

I'd say we start with just a separate command for dumping histogram data to the console, adding fancy graphs for this can always be done at a later time. Start small.

@l8gravely Would you be comfortable writing the code for this? If not, I'd be happy to whip up a proof of concept to get this started?

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:

Hey folks, long time no talk. I must admit I kind of ignored most of the duc-related traffic as of late, apologies for the lack of time and love spent on this project. @l8gravely I really do appreciate your effors of keeping things going, thank you for that!

I do like this histogram feature though, I do agree with the exponential bin size, base 2 and 10 both make sense so that would make a nice option.

I'd say we start with just a separate command for dumping histogram data to the console, adding fancy graphs for this can always be done at a later time. Start small.

@l8gravely Would you be comfortable writing the code for this? If not, I'd be happy to whip up a proof of concept to get this started?

I've already got some basic code for dumping histograms using the 'duc info ...' command with the -H switch. We could certainly make a new command such as:

duc histogram <opts>

and give various options there, but the collection part is where the choice of base 2 or base 10 would have to take place. Or we could just spend the memory/disk space and keep both! That wouldn't be hard to do really.

The big issue is bucket size and spacing. I think powers of 10 is too much honestly, the gap between 1mb and 1gb is too big, though 1mb, 10mb, 100mb, 1000mb, 10000mb isn't terrible now that I think of it.

So my plan right now is:

add in a new histogram command
keep histograms in 'info' command for now.
tweak adding of both base2 and base10 data buckets.

Ico, if you could work on the gui and web side, that would be awesome. I've got the 'histogram' branch going right now, so feel free to build on there, though I'll be pushing new changes there this weekend or next week.

zevv commented 4 months ago

Ah, you already got the work started, nice! I also did some experiments already to see how to hack this in, and the implementation should not be too hard.

Having the powers configurable sounds ok to me, although having a standard of 2 feels natural to me and probably would fit most needs: this results in a histogram with 30-ish bins, which feels like a very manageable size.

Some questions will probably pop up on how to (if to) handle recursion.

I'll have to play around a bit with the gui side to see how to properly represent this; there's the choice to make this either into a separate thing, or combine the histogram with the normal pie chart into a single graphic - I guess that will just need some playing around and prototying.

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:

Ah, you already got the work started, nice! I also did some experiments already to see how to hack this in, and the implementation should not be too hard.

Great, let's play around with this. I've just implemented the 'histogram' command as well. I'll be pushing that in a bit.

Having the powers configurable sounds ok to me, although having a standard of 2 feels natural to me and probably would fit most needs: this results in a histogram with 30-ish bins, which feels like a very manageable size.

Mine goes upto 2^36 in my test on my home system because I have a single 72gb file in there. So it needs to auto-scale properly. But the jump between 2^35 and 2^36 just seems too big to me... and the smaller numbers below 128bytes seem way too close to me.

Some questions will probably pop up on how to (if to) handle recursion.

Nope, it's all just part of the data collection, look at what I did in the index.c stuff. It just fell in quite nicely. Now I do want to make the histogram auto-size nicely. Got some ideas there...

I'll have to play around a bit with the gui side to see how to properly represent this; there's the choice to make this either into a separate thing, or combine the histogram with the normal pie chart into a single graphic - I guess that will just need some playing around and prototying.

I was thinking that for the html version, just having the histogram be below the ring chart might be nice, or maybe make it do you toggle between them? It's not clear to me that there's any real need for interactivity.

One thing I am thinking of adding when I get a chance is a running total file size in each bucket, and how that then reflects across the entire histogram. With that last bucket ending at 100%.

I'll try to push some thoughts later this weekend as I get chances.

John

zevv commented 4 months ago

Oh boy it's happening again: I'm thinking of a new feature and tons of questions come up. I'll just rant a little here, not sure how much is really relevant but we'll need to make some choices:

Aggregation level, global or per-directory? Your current implementation is simple and concise, but it only aggregates the histogram data at the global index level, but not at the individual directory. The good thing about it is that it is cheap on storage, the bad thing is that there is no histogram data available for individual directories. Personally, I would like to have histograms available for the individual directories I'm currently inspecting. This would require duc to store the histogram for every directory, but that'll eat considerably more storage - at least DUC_HISTOGRAM_MAX varints for every dir.
Aggregate at index time or at display time? Building the histogram at index time makes sense because all the data is there at the right time, and querying would be fast. The price we pay is storage. Alternatively, we could also generate the histogram at display time, and have the histogram match the data that is being displayed: when we draw the pie chart we already recursively traverse the database for N levels deep, so it would be trivial to accumulate the histogram bins at that time. The advantage would be that the histogram actually matches what is displayed at that time; we can generate two different data sets: one for files, one for directories. Down side is that there will be no easy way to histogram for the FS at large without traversing the whole database.
Bin sizes: if storage is not an issue we can just choose a larger value for the number of bins (256 or so) and use an exponent smaller than 2 ( sqrt(2) for example). If the user chooses to display the histogram with a lower number of bins it'll be trivial to do this at display time by simply combining bins. That would help with your problem of the useless small size bins (<128) or subjective large gaps for huge files.

But first it's time for my sunday hike. Priorities, right!

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:

Oh boy it's happening again: I'm thinking of a new feature and tons of questions come up. I'll just rant a little here, not sure how much is really relevant but we'll need to make some choices:

rant away!

Aggregation level, global or per-directory? Your current implementation is simple and concise, but it only aggregates the histogram data at the global index level, but not at the individual directory. The good thing about it is that it is cheap on storage, the bad thing is that there is no histogram data available for individual directories. Personally, I would like to have histograms available for the individual directories I'm currently inspecting. This would require duc to store the histogram for every directory, but that'll eat considerably more storage - at least DUC_HISTOGRAM_MAX varints for every dir.

I think it only needs to be global. Doing it on a per-directory level would be a nightmare. Look at all the issues I've been having with Sturt on his systems. Doing that per-directory... that would be extreme.

Now maybe we do want to do the same type of histograms for the number of files in a directory, since that's another interesting piece of data.

But I think the next feature I want to add is tracking the largest N files found in the system. That's a really useful feature to have. The histogram is a neat set of data, but doesn't help you find where the disk space is being used. So having a cutoff of 1g at the lower end (but configurable) would be useful. This kind of data is exactly what I used to need to help track down issues, since I could email my users listing the top N large files by user and ask them to clean up. Nice easy (sometimes) low hanging fruit to clean for maximum impact.

And also keeping a list of the top N directories by file count in that directory. And of course having a base number to not care about, say 1024 or less files.

Aggregate at index time or at display time? Building the histogram at index time makes sense because all the data is there at the right time, and querying would be fast. The price we pay is storage. Alternatively, we could also generate the histogram at display time, and have the histogram match the data that is being displayed: when we draw the pie chart we already recursively traverse the database for N levels deep, so it would be trivial to accumulate the histogram bins at that time. The advantage would be that the histogram actually matches what is displayed at that time; we can generate two different data sets: one for files, one for directories. Down side is that there will be no easy way to histogram for the FS at large without traversing the whole database.

Again, we can only gather the data during index time, there's no way it would work during display time, and that's not duc's reason to exist. We handle the stupid large systems that can't work well with simple 'du -sh *' type calls because they beat the crap out of the filesystem. Same with histograms.

Bin sizes: if storage is not an issue we can just choose a larger value for the number of bins (256 or so) and use an exponent smaller than 2 ( sqrt(2) for example). If the user chooses to display the histogram with a lower number of bins it'll be trivial to do this at display time by simply combining bins. That would help with your problem of the useless small size bins (<128) or subjective large gaps for huge files.

It's not so much the number of bins, they're actually cheap on a per-filesystem basis. It's the spacing that I'm trying to figure out. We don't care about files below 512 bytes really, but we do care about files between 1g and 10g right? Or 10g to 100g in size, how fine grained do we want to be here?

But first it's time for my sunday hike. Priorities, right!

Sounds like an excellent priority!

John

zevv commented 4 months ago

Again, we can only gather the data during index time, there's no way i t would work during display time, and that's not duc's reason to exist. We handle the stupid large systems that can't work well with simple 'du -sh *' type calls because they beat the crap out of the filesystem. Same with histograms.

Yes, you're right, it's two different use cases. Your case is to get some global stats from a file system as a whole, so it woudl indeed make sense to do this at index time and just add this as metadata for that particular index.

My use case would be to draw a histogram only of the part of the graph that's being dispalyed: so if I get a pie chart of some directory 5 levels deep, the histogram would only show the files for these 5 levels, as these directories are already traversed when drawing the graph.

I did a little POC of this in the zevv-histogram branch, you might want to have a peek at that to see one possible way of displaying the histogram data. It's pretty basic, but the parts are there.

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:
Again, we can only gather the data during index time, there's no way
i t would work during display time, and that's not duc's reason to
exist. We handle the stupid large systems that can't work well with
simple 'du -sh *' type calls because they beat the crap out of the
filesystem. Same with histograms.
Yes, you're right, it's two different use cases. Your case is to get some global stats from a file system as a whole, so it woudl indeed make sense to do this at index time and just add this as metadata for that particular index.

Good we're in agreement.

My use case would be to draw a histogram only of the part of the graph that's being dispalyed: so if I get a pie chart of some directory 5 levels deep, the histogram would only show the files for these 5 levels, as these directories are already traversed when drawing the graph.

Hmm... an interesting idea, but it really breaks when the http server where you're pulling the info doesn't have access to the system where the indexes were generated. I had multiple data centers which I ran 'duc' in, but I pushed all the DBs to a central host to make displaying the data easier.

So I don't think this is really going to work. It's just not part of the core design of duc, which is to seperate the indexing and the data display into two totally seperate processes.

Look at me lecturing the main author! LOL!

I did a little POC of this in the zevv-histogram branch, you might want to have a peek at that to see one possible way of displaying the histogram data. It's pretty basic, but the parts are there.

I'll take a look at some point later tonight.

For example. Stuart's test case was taking 9+ hours to run and grab the data. So when displaying data, how would you dynamically run through the filesystem to get histogram data?

I really want to also add a feature where I grab the top 100 largest (or some percentage of the filesystem number of files) with a lower bound of some sort. Mostly it's me trying to learn how to write this all in C again. LOL! I'm so rusty it's not funny.

zevv commented 4 months ago

Look at me lecturing the main author! LOL!

Yeah, well, someone needs to step up and teach the dick a lesson every now and then!

how would you dynamically run through the filesystem to get histogram data?

You don't, because all the data is in the duc db, right?

zevv commented 4 months ago

Mostly it's me trying to learn how to write this all in C again. LOL! I'm so rusty it's not funny.

Well, the hip thing to do here would be to rewrite the whole thing in Rust, eh!

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:
Look at me lecturing the main author! LOL!
Yeah, well, someone needs to step up and teach the dick a lesson every now and then!
how would you dynamically run through the filesystem to get histogram data?
You don't, because all the data is in the duc db, right?

That's one option, but I suspect it's going to blow up the size of the DB to keep that data on per-directory basis. I don't know for sure. It's probably not a problem on smaller systems, but with lots and lots of directories... it could get expensive.

For now, I'm going to work on the finding the top N files and keeping them in the DB as well. I think that will be a really useful feature to add to the indexing and data side.

Then to figure out how to report it. 'duc topn '? Not wild about 'topn' name. "duc largest"? "duc files"? For now I'll just add it to 'duc info' like I did for my initial histogram implementation.

John

l8gravely commented 4 months ago

"Ico" == Ico Doornekamp @.***> writes:
Mostly it's me trying to learn how to write this all in C again. LOL! I'm so rusty it's not
funny.
Well, the hip thing to do here would be to rewrite the whole thing in Rust, eh!

What about 'go' instead? :-)

l8gravely commented 3 months ago

Guys, I've just pushed my next set of patches to implement the topN files by size added into duc, using the 'histogram' branch. Stills needs a bunch of work but I'm on vacation, so I figured I'd let people look at this.

Please pull the 'histogram' branch and poke at it. I've got future work to do here though:

remove my debugging printf() calls
fix writing the topN data into the DB properly when done indexing
fix reading of topN data (should work...) properly to then display results.
fix option passing when indexing to adjust the minimum size of file to index, and then to change the number of files to index.
Lots more testing of edge cases, especially when reading old version files.

But I think I'm getting closer and I hope to release v1.5.0a sometime late July. So any patches, comments, fixes are welcome!

John

l8gravely commented 2 weeks ago

So this initial idea has been implemented into v1.5.0-rc1 which I released earlier this week. I'd love to get some testing and feedback.

zevv / duc

RFE histogram #284

Histogram: