zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
589 stars 79 forks source link

read a pipe/file for indexing #180

Open ajw1980 opened 7 years ago

ajw1980 commented 7 years ago

I'm testing out duc on a GPFS file system with about 80 million files. Building an index is going to take a while. I was wondering if it would be possible to add support for reading a pipe or a file to create the index. I can use a GPFS policy to create a complete list of files in about 20 minutes, so if there was a way to use that for duc it would vastly reduce index build time.

l8gravely commented 7 years ago

"ajw1980" == ajw1980 notifications@github.com writes:

ajw1980> I'm testing out duc on a GPFS file system with about 80 ajw1980> million files. Building an index is going to take a while. I ajw1980> was wondering if it would be possible to add support for ajw1980> reading a pipe or a file to create the index. I can use a ajw1980> GPFS policy to create a complete list of files in about 20 ajw1980> minutes, so if there was a way to use that for duc it would ajw1980> vastly reduce index build time.

Can you give us more details on how GPFS would return such a list? Duc uses the regular POSIX (ok, really unix) filesystem calls to read and process the filesystem. Adding in another input method shouldn't be too hard actually...

ajw1980 commented 7 years ago

The native format for the generated list from a GPFS policy is described here:

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_recordformat.htm

Possible file attributes that can be shown are here:

https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adv_usngfileattrbts.htm

I made some output for this policy:

RULE 'listall' list 'all-files' DIRECTORIES_PLUS SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || varchar(mode) )

Here is the output: 5001:000fffffffffffff:0000000000072274:1001b:100:adc447:10003:200:20!share/man/man1/duc.1:12!chulocaldata;21!256 21756 -rw-r--r-- 5001:000fffffffffffff:0000000000072278:1001b:0:adc447:0:4000600:9!share/man:6!system;18!0 4096 drwxr-xr-x 5001:000fffffffffffff:00000000000722b3:10009:0:adc447:0:4000600:14!share/man/man1:6!system;18!0 4096 drwxr-xr-x 5001:000fffffffffffff:00000000000722c7:10008:0:adc447:0:4000600:5!share:6!system;18!0 4096 drwxr-xr-x 5001:000fffffffffffff:00000000000722c8:10008:0:adc447:0:4000600:3!bin:6!system;18!0 4096 drwxr-xr-x 5001:000fffffffffffff:00000000000722c9:10008:0:adc447:0:4000600:1!.:6!system;18!0 4096 drwxr-xr-x 5001:000fffffffffffff:00000000000722ca:10008:200:adc447:10003:200:7!bin/duc:12!chulocaldata;22!512 431849 -rwxr-xr-x

The pathname is relative to /gpfs/fs3/admin/duc/ppc64 which is the path I used for the policy command. kb_allocated vs file_size is the same distinction as actual vs apparent in duc (the 0/4096 is because this file system uses 4k inodes so anything stored in an inode will report 0 data for kb_allocated).

There is an alternate policy method that executes an external script and gives the script a file name with a list of files in it. These file lists are usually not the entire set of files at once. The file lists are split into 100 files at a time by default. Here is the documentation for that format:

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_filelistfile.htm

Alternatively, you could of course just specify an input format for a duc index command and I can write a script to mangle the file list however is required for that.

On Thu, Jun 15, 2017 at 3:43 PM, John notifications@github.com wrote:

"ajw1980" == ajw1980 notifications@github.com writes:

ajw1980> I'm testing out duc on a GPFS file system with about 80 ajw1980> million files. Building an index is going to take a while. I ajw1980> was wondering if it would be possible to add support for ajw1980> reading a pipe or a file to create the index. I can use a ajw1980> GPFS policy to create a complete list of files in about 20 ajw1980> minutes, so if there was a way to use that for duc it would ajw1980> vastly reduce index build time.

Can you give us more details on how GPFS would return such a list? Duc uses the regular POSIX (ok, really unix) filesystem calls to read and process the filesystem. Adding in another input method shouldn't be too hard actually...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zevv/duc/issues/180#issuecomment-308860365, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZMNQli_1acWZwJ7-GzjUkPp4wuB7ckks5sEZd3gaJpZM4N7req .

l8gravely commented 7 years ago

"ajw1980" == ajw1980 notifications@github.com writes:

ajw1980> The native format for the generated list from a GPFS policy is described ajw1980> here:

ajw1980> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/ ajw1980> com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_recordformat.htm

ajw1980> Possible file attributes that can be shown are here:

ajw1980> https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/ ajw1980> com.ibm.spectrum.scale.v4r23.doc/bl1adv_usngfileattrbts.htm

ajw1980> I made some output for this policy:

ajw1980> RULE 'listall' list 'all-files' DIRECTORIES_PLUS ajw1980> SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || ajw1980> varchar(mode) )

ajw1980> Here is the output: ajw1980> 5001:000fffffffffffff:0000000000072274:1001b:100:adc447:10003:200:20!share/man/ ajw1980> man1/duc.1:12!chulocaldata;21!256 ajw1980> 21756 -rw-r--r-- ajw1980> 5001:000fffffffffffff:0000000000072278:1001b:0:adc447:0:4000600:9!share/man:6! ajw1980> system;18!0 ajw1980> 4096 drwxr-xr-x ajw1980> 5001:000fffffffffffff:00000000000722b3:10009:0:adc447:0:4000600:14!share/man/ ajw1980> man1:6!system;18!0 ajw1980> 4096 drwxr-xr-x ajw1980> 5001:000fffffffffffff:00000000000722c7:10008:0:adc447:0:4000600:5!share:6! ajw1980> system;18!0 ajw1980> 4096 drwxr-xr-x ajw1980> 5001:000fffffffffffff:00000000000722c8:10008:0:adc447:0:4000600:3!bin:6!system; ajw1980> 18!0 ajw1980> 4096 drwxr-xr-x ajw1980> 5001:000fffffffffffff:00000000000722c9:10008:0:adc447:0:4000600:1!.:6!system; ajw1980> 18!0 ajw1980> 4096 drwxr-xr-x ajw1980> 5001:000fffffffffffff:00000000000722ca:10008:200:adc447:10003:200:7!bin/duc:12! ajw1980> chulocaldata;22!512 ajw1980> 431849 -rwxr-xr-x

ajw1980> The pathname is relative to /gpfs/fs3/admin/duc/ppc64 which is the path I ajw1980> used for the policy command. kb_allocated vs file_size is the same ajw1980> distinction as actual vs apparent in duc (the 0/4096 is because this file ajw1980> system uses 4k inodes so anything stored in an inode will report 0 data for ajw1980> kb_allocated).

Ok, that's good to know.

ajw1980> There is an alternate policy method that executes an external script and ajw1980> gives the script a file name with a list of files in it. These file lists ajw1980> are usually not the entire set of files at once. The file lists are split ajw1980> into 100 files at a time by default. Here is the documentation for that ajw1980> format:

This doesn't seem like a win, unless the script can push out 100 file blocks faster than duc can traverse the filesystem.

ajw1980> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/ ajw1980> com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_filelistfile.htm

ajw1980> Alternatively, you could of course just specify an input ajw1980> format for a duc index command and I can write a script to ajw1980> mangle the file list however is required for that.

I'd have to go back and look more closely at the duc code to remember how the traversal algorithm happens, and if it's not the same as the GPFS method, then it's going to be problematic I suspect. Unfortunately, I don't have time to look at this until next week at the earliest.

But off the top of my head, you will need to do the following:

  1. update the 'duc index' command to support a new set of switches, something like:

    --pipe=gpfs

to tell the indexer to read from stdin, using the GPFS parsing format. Can't hurt to leave some expandability in place.

  1. implement the GPFS filter to setup it's own versions of libduc/index.c functions scanner_new() and scanner_scan() and scanner_free() at a minimum to fill the DB.

  2. figure out how to plug this all into the machinery. It might be simpler to make a new program which calls into libduc library to do the work outside of the 'duc' tool. Certainly for an implementation test that would be easier.

Then hopefully once the DB is created, the regular tools will be able to open it up and parse the data.

So how would you setup the pipe to export the data? I assume it would export via STDOUT, with errors to STDERR, etc?

From what I see, the tricky thing to confirm is whether or not the GPFS export script correctly escapes any filenames with the seperators, though if you make them be '|' it shouldn't be a problem. Just off the top of my head the fields would be:

::: which is really just the data found in the fstatat() call used by duc. Does GPFS support hardlinks? Do you care about hardlinks? From looking at it, it shouldn't be too hard to take the scanner_scan() core and turn it into filter_gpfs_scan() or something like that to parse the data and stuff it into the DUC DB. From what I see, if your filter returns stuff in DEPTH FIRST order, then it's going to be ok, but if not, then it will take more work and more memory. But depth first I mean that: scan(dir) { opendir(dir). while entry=readdir(dir) if (entry_type == DIR) scan(entry) else add to totals(entry) end done } it's a recursive process, walking the tree all the way down before it returns. Your filter needs to return data like that. John ajw1980> On Thu, Jun 15, 2017 at 3:43 PM, John wrote: >> >>>>> "ajw1980" == ajw1980 writes: >> ajw1980> I'm testing out duc on a GPFS file system with about 80 ajw1980> million files. Building an index is going to take a while. I ajw1980> was wondering if it would be possible to add support for ajw1980> reading a pipe or a file to create the index. I can use a ajw1980> GPFS policy to create a complete list of files in about 20 ajw1980> minutes, so if there was a way to use that for duc it would ajw1980> vastly reduce index build time. >> >> Can you give us more details on how GPFS would return such a list? >> Duc uses the regular POSIX (ok, really unix) filesystem calls to read >> and process the filesystem. Adding in another input method shouldn't >> be too hard actually... >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> , or mute >> the thread >> >> . >> ajw1980> — ajw1980> You are receiving this because you commented. ajw1980> Reply to this email directly, view it on GitHub, or mute the thread.*