zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
589 stars 79 forks source link

Feature request: atime/mtime/ctime metadata based color pallet #186

Open krisp opened 7 years ago

krisp commented 7 years ago

Hi there! Great work on this project. It's a fantastic way to visualize storage utilization. I've been using it for about 3 or 4 months now to index a few PB's of research data to show consumption to my users. One question I get a lot is "do these colors represent file age?" Wouldn't it be great if they did? I haven't looked much at the code yet but from what I understand at a high level it is crawling the filesystem and statting files to get their sizes, and recording them in a tokyocabinet db file.

Do you think it would be a huge undertaking to adjust the schema and record some additional folder metadata at index time to use to determine the color pallet of the chart? If not I think this would be a great addition to the tool.

zevv commented 7 years ago

Hi David,

Quoting David M (2017-10-07 02:25:40)

Hi there! Great work on this project. It's a fantastic way to visualize storage utilization. I've been using it for about 3 or 4 months now to index a few PB's of research data to show consumption to my users. One question I get a lot is "do these colors represent file age?" Wouldn't it be great if they did? I haven't looked ,much at the code yet but from what I understand at a high level it is crawling the filesystem and statting files to get their sizes, and recording them in a tokyocabinet db file.

Do you think it would be a huge undertaking to adjust the schema and record some additional file metadata at index time to use to determine the color pallet of the chart? If not I think this would be a great addition to the tool.

That's a neat idea, funny that it ever occured to me to add this, given the absolute usefulness of the concept.

Adding the <c|m|a>time to the database should be trivial, as this data is already collected during indexing, just not stored.

The graph writing is a tad more complex, because the color should probably be relative to all the files in the graph - oldest file red, newest file green, something like that. This means the tree has to be traversed twice: once to find the oldes and newest time, and the second time for drawing.

I'll look into this when I find the time!

-- :wq ^X^Cy^K^X^C^C^C^C

l8gravely commented 7 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Hi David, Ico> Quoting David M (2017-10-07 02:25:40)

Hi there! Great work on this project. It's a fantastic way to visualize storage utilization. I've been using it for about 3 or 4 months now to index a few PB's of research data to show consumption to my users. One question I get a lot is "do these colors represent file age?" Wouldn't it be great if they did? I haven't looked ,much at the code yet but from what I understand at a high level it is crawling the filesystem and statting files to get their sizes, and recording them in a tokyocabinet db file.

Do you think it would be a huge undertaking to adjust the schema and record some additional file metadata at index time to use to determine the color pallet of the chart? If not I think this would be a great addition to the tool.

Ico> That's a neat idea, funny that it ever occured to me to add this, Ico> given the absolute usefulness of the concept.

It is a neat idea, right up there with the file-count-per-directory that people ask for from time to time.

Ico> Adding the <c|m|a>time to the database should be trivial, as this Ico> data is already collected during indexing, just not stored.

I just worry that the DB size will end up bigger and bigger unless we can figure out some way to compress it. Maybe we just use a single byte to store an offset from the DB time relating to the mtime of the file, or maybe two bytes. But I think that we just need a one week granularity, with old files more than 250+ weeks old being considered stale. Does that make sense? Or do we change it to be a month old?

Or some scheme where it's days, then weeks, then months in age crammed into a byte?

Ico> The graph writing is a tad more complex, because the color should Ico> probably be relative to all the files in the graph - oldest file Ico> red, newest file green, something like that. This means the tree Ico> has to be traversed twice: once to find the oldes and newest Ico> time, and the second time for drawing.

It should be simple enough to add two bytes to the directory info which list the youngest and oldest files found in the directory as the scan is done, so that generation stays reasonably quick.

Ico> I'll look into this when I find the time!

As the days get darker, I'm doing more development again myself!

robina80 commented 6 years ago

any news on this as this would be a really cool feature to have, to show the a/c/m time

zevv commented 6 years ago

No news yet, but thanks for reminding me, I might find some time one of these days to look into this.

I just realized I missed John's post about the timestamps. Storing a relative time is a good idea indeed. Because Duc uses variable length integers in the database anyway, this will nicely reduce the size without needing other tricks.

l8gravely commented 6 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> During implementation I ran into some questions: what should be Ico> the time of a directory? The actual times reported by the file Ico> system, or should a directory inherit the oldest time stamp of Ico> all the files and subdirectories in it?

I would think the least surprise would be to inherit the st_mtime value you get when lstat() is done on a directory. It's defined as the time of last create/delete of files in that directory. Having it inherit an arbitrarily deep time doesn't seem to make sense.

So let's think about how this would be useful either way. If the time is left as is, you can't see at the top level is a directory has old contents if a single new file was created in that directory, but all the rest are old.

But that would be visual represented because the color of the directory(s) underneath it would still be blue, since they have old files.

Oh yeah, I'm thinking we use dark blue for old, cold files, and bright red for new, hot files. With a gradation between of course.

Thinking some more, would it be better to return the average (or better yet median age) of the files/directories underneath instead? This feature is a way to help find old stuff, so if you have a directory with mostly old old stuff, you don't want a single small new file to throw off the measurements.

It might also be nice to actually have the age returned be the median of age * size, so larger older files have more weight than younger, smaller files. But a new, large file would help skew the age back to newer.

Thoughts? John

zevv commented 6 years ago

Quoting John (2018-04-03 16:45:37)

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> During implementation I ran into some questions: what should be Ico> the time of a directory? The actual times reported by the file Ico> system, or should a directory inherit the oldest time stamp of Ico> all the files and subdirectories in it?

I would think the least surprise would be to inherit the st_mtime value you get when lstat() is done on a directory. It's defined as the time of last create/delete of files in that directory. Having it inherit an arbitrarily deep time doesn't seem to make sense.

It does: if you are looking for the oldest files on your system, and your directories are deeper then the default 4 levels Duc displays, you will not be able to see where these files live. By inheriting the oldest entry all the way down, it is obvious where you can find the oldest files.

Thinking some more, would it be better to return the average (or better yet median age) of the files/directories underneath instead?

Yeah, I've been thinking of doing something weighted, but I think there is no one single best solution.

It might also be nice to actually have the age returned be the median of age * size, so larger older files have more weight than younger, smaller files. But a new, large file would help skew the age back to newer.

Thoughts?

Ideally, the user should be able to choose this in some way. The problem is that this will probably become a mess to use and configure.

I'm tinkering with some implementations, but I have not yet found something that results in an intuitive result...

-- :wq ^X^Cy^K^X^C^C^C^C

robina80 commented 6 years ago

i was just thinking about this

like apache directory listing if you enable it in "httpd.conf" and when you go to URL link the options at the top is "last modified"

is this helpful

l8gravely commented 6 years ago

"robina80" == robina80 notifications@github.com writes:

robina80> i was just thinking about this robina80> like apache directory listing if you enable it in "httpd.conf" and when you go to URL link the robina80> options at the top is "last modified"

robina80> is this helpful

Not really, since the display issue is really seperate from the database format issue, which means there needs to be a format change to support newer modes.

And doing this in a backwards compatible way is going to be hard, but not impossible.

Maybe when the winter comes I'll have more time to hack on this. Or Ico, who is a better prgrammer than me for sure.

John

robina80 commented 6 years ago

im afraid all the programming i know is linux bash or windows batch scripting but im more than willing to lend a hand

robina80 commented 6 years ago

does 1.4.4 do this?

zevv commented 6 years ago

Quoting robina80 (2018-09-28 17:39:12)

does 1.4.4 do this?

No, sorry.

-- :wq ^X^Cy^K^X^C^C^C^C

robina80 commented 6 years ago

thank you zevv

mfalda commented 6 years ago

I am using your beautiful tool for indexing a huge partition for freeing some space and I think that modification dates would be useful not (just) for colors but as an additional filter and grouping criterion. Zooming would refer to years, trimesters, months and so on but in a more "static" way; the time depth could be specified when indexing. For example:

Thank you in any case.

zevv commented 6 years ago

Hi Marco,

Quoting Marco Falda (2018-10-03 14:20:55)

in the sub-command ui there would be a list of dates (years or months according to the zoom level) on top or on the left (vertically), whose items could be selected using PgDown and PgUp; when a date is selected all files more recent than that date would be filtered out;

Thanks for your suggestions; I've been looking in similar features in the past, but I run into some structural issues that I have not resolved yet.

The problem has to do with the way totals for directories are calculated and handled: during indexing all directories are traversed, and the total size of each directory is stored in the database (this is what makes Duc so much faster then using 'du', no use to traverse the tree to get totals).

This mechanism will not work however when filtering on dates; consider the following tree for example :

during indexing the size of the '/etc' directory will be calculated, being 300 bytes in this example - but what do I need to do when the user wants to inspect only a certain yea like 2010? In this case only '/etc/file2' is shown, but the total size of the '/etc/' directory is now wrong, since this still shows 300b.

It is generally hard to do any kind of filtering on a tree with pre-calculated totals. A solution could be to traverse the database and make new totals only matching the selected files on demand, but I'm afraid this will not scale well for large trees.

-- :wq ^X^Cy^K^X^C^C^C^C

mfalda commented 6 years ago

I understand that the problem is not easily solvable in an efficient way. For this purpose I used du and Pandas (Python dataframes that can be indexed). Perhaps it could be possible to store the dates in the databases and post-process them for a limited number of years and depth creating additional databases with the same prefix (homes.db, homes-2015.sb, ...) that would be selected when filtering or applying dynamic filters when exploring the hierarchy in ui or gui (always with a limited number of directories), but I recognize that this would be cumbersome.

l8gravely commented 6 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Hi Marco, Ico> Quoting Marco Falda (2018-10-03 14:20:55)

in the sub-command ui there would be a list of dates (years or months according to the zoom level) on top or on the left (vertically), whose items could be selected using PgDown and PgUp; when a date is selected all files more recent than that date would be filtered out;

Ico> Thanks for your suggestions; I've been looking in similar Ico> features in the past, but I run into some structural issues that Ico> I have not resolved yet.

Ico> The problem has to do with the way totals for directories are Ico> calculated and handled: during indexing all directories are Ico> traversed, and the total size of each directory is stored in the Ico> database (this is what makes Duc so much faster then using 'du', Ico> no use to traverse the tree to get totals).

Ico> This mechanism will not work however when filtering on dates; consider Ico> the following tree for example :

Ico> - /etc (size=300b, date=2010) Ico> |- file1 (size=100b date=2010) Ico> `- file2 (size=200b date=2011)

Maybe it makes sense to add instead add in a 10 entry field which holds totals of the data below for various age bands. It would require a new DB format, but maybe it could be made flexible enough.

Something like the following breakdown, and we just record the percentage of space/files in each bucket. It's not perfect, but might be an answer.

< 1 week, < 2 weeks, < 1 month < 3 months < 6 months < 12 months < 2 years, < 3 years, < 5 years

5 years

So it would expand the DB size by quite a bit, but we might be able to make it dynamic so that if all the files are less than a week old, we only keep one entry at 100%.

Ico> during indexing the size of the '/etc' directory will be Ico> calculated, being 300 bytes in this example - but what do I need Ico> to do when the user wants to inspect only a certain yea like Ico> 2010? In this case only '/etc/file2' is shown, but the total size Ico> of the '/etc/' directory is now wrong, since this still shows Ico> 300b.

I don't think we can provide that, it's more we'd give them a heat map of where the oldest/youngest files are so they can investigate in a more targetted manner.

Ico> It is generally hard to do any kind of filtering on a tree with Ico> pre-calculated totals. A solution could be to traverse the Ico> database and make new totals only matching the selected files on Ico> demand, but I'm afraid this will not scale well for large trees.

Nah, we don't want to do this at all, we want to keep the data gathering all in one place, and just compute summaries as we go up the tree.

Maybe as a proof of concept we do < 1 month, < 1 year, > 2 year

which with three tuples on the DB, gives us four bands of data.

John