zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
589 stars 79 forks source link

Documentation for DB format? #329

Open spikebike opened 1 month ago

spikebike commented 1 month ago

I wanted to add a scanner for the Cray Scanner which consumes Lustre changelogs. That way instead of duc scanning I could use the Cray API which is kept up to date by consuming changelogs.

I made a test dir: /home/MyUser/tmp: total 12 drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirA drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:48 DirB drwxrwxr-x 2 MyUser MyUser 4096 Aug 16 13:49 DirC

/home/MyUser/tmp/DirA: total 4 -rw-rw-r-- 1 MyUser MyUser 2 Aug 16 13:48 1

/home/MyUser/tmp/DirB: total 8 -rw-rw-r-- 1 MyUser MyUser 3 Aug 16 13:48 2 -rw-rw-r-- 1 MyUser MyUser 4 Aug 16 13:48 3

/home/MyUser/tmp/DirC: total 12 -rw-rw-r-- 1 MyUser MyUser 5 Aug 16 13:49 4 -rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 5 -rw-rw-r-- 1 MyUser MyUser 6 Aug 16 13:49 6

I made the database SQLITE, I figured it would be the easiest way to inspect the result. Here's the schema and dump: $ cat schema CREATE TABLE blobs(key unique primary key, value); CREATE INDEX keys on blobs(key);

$ cat dump PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE blobs(key unique primary key, value); INSERT INTO blobs VALUES('fc01/19c42d8',X'f9f311fb019c42d7fb66bfad15013102f907100105'); INSERT INTO blobs VALUES('fc01/19c42d9',X'f9f311fb019c42d7fb66bfad28013304f907100105013203f907100105'); INSERT INTO blobs VALUES('fc01/19c42da',X'f9f311fb019c42d7fb66bfad3b013506f907100105013405f907100105013606f907100105'); INSERT INTO blobs VALUES('fc01/19c42d7',X'0000fb66bfacce044469724102f917100202f9f311fb019c42d8044469724207f927100302f9f311fb019c42d9044469724311f937100402f9f311fb019c42da'); INSERT INTO blobs VALUES('duc_index_reports',X'2f686f6d652f6262726f61646c65792f746d7 [MANY ZEROS DELETED]'); INSERT INTO blobs VALUES('/home/MyUser/tmp',X'132f686f6d652f6262726f61646c65792f746d70f9f311fb019c42d7fb66bfad7efa0e8e37fb66bfad7efa0e8f0b06041af997100a'); CREATE INDEX keys on blobs(key); COMMIT;

What's the value for "/home/MyUser/tmp"?

How does MyDirA/1 map to 'fc01/19c42d8' or similar?

Is the value for each dir some encoding of filename + size? So 3 files = file1+size,file2+size,file3+size or something?

What's the value for duc_index_reports?

l8gravely commented 1 month ago

"Bill" == Bill Broadley @.***> writes:

I wanted to add a scanner for the Cray Scanner which consumes Lustre changelogs. That way instead of duc scanning I could use the Cray API which is kept up to date by consuming changelogs.

Awesome! Do you have any pointers to the documentation of this tool? Some questions I have:

  1. can you import the full current filesystem info from the API?

  2. OR do you have to scan the filesystem using statx() and then attach to the API to get changes after a certain time?

At it's heart, duc is pretty simple, it just reads the data and stuffs it into a DB using the directory path as the key, then putting more stuff below level. Look at src/libduc/index.c for a quick look.

I'd also recommend you looking at my 'tkrzw' branch where I've been putting a bunch of new code that will become the version 1.5 in the next few weeks hopefully.

Both second features are just barebones and need more work, especially for the reporting in GUI, UI and CGI formats.

So this is a good time to add new features if possible.

John

I made a test dir: /home/MyUsery/tmp: total 12 drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:48 DirA drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:48 DirB drwxrwxr-x 2 MyUsery MyUsery 4096 Aug 16 13:49 DirC

/home/MyUsery/tmp/DirA: total 4 -rw-rw-r-- 1 MyUsery MyUsery 2 Aug 16 13:48 1

/home/MyUsery/tmp/DirB: total 8 -rw-rw-r-- 1 MyUsery MyUsery 3 Aug 16 13:48 2 -rw-rw-r-- 1 MyUsery MyUsery 4 Aug 16 13:48 3

/home/MyUsery/tmp/DirC: total 12 -rw-rw-r-- 1 MyUsery MyUsery 5 Aug 16 13:49 4 -rw-rw-r-- 1 MyUsery MyUsery 6 Aug 16 13:49 5 -rw-rw-r-- 1 MyUsery MyUsery 6 Aug 16 13:49 6

I made the database SQLITE, I figured it would be the easiest way to inspect the result. Here's the schema and dump: $ cat schema CREATE TABLE blobs(key unique primary key, value); CREATE INDEX keys on blobs(key);

$ cat dump PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE blobs(key unique primary key, value); INSERT INTO blobs VALUES('fc01/19c42d8',X'f9f311fb019c42d7fb66bfad15013102f907100105'); INSERT INTO blobs VALUES('fc01/ 19c42d9',X'f9f311fb019c42d7fb66bfad28013304f907100105013203f907100105'); INSERT INTO blobs VALUES('fc01/ 19c42da',X'f9f311fb019c42d7fb66bfad3b013506f907100105013405f907100105013606f907100105'); INSERT INTO blobs VALUES('fc01/ 19c42d7',X'0000fb66bfacce044469724102f917100202f9f311fb019c42d8044469724207f927100302f9f311fb019c42d9044469724311f937100402f9f311fb019c42da'); INSERT INTO blobs VALUES('duc_index_reports',X'2f686f6d652f6262726f61646c65792f746d7 [MANY ZEROS DELETED]'); INSERT INTO blobs VALUES('/home/MyUser/ tmp',X'132f686f6d652f6262726f61646c65792f746d70f9f311fb019c42d7fb66bfad7efa0e8e37fb66bfad7efa0e8f0b06041af997100a'); CREATE INDEX keys on blobs(key); COMMIT;

What's the value for "/home/MyUser/tmp"?

How does MyDirA/1 map to 'fc01/19c42d8' or similar?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.*Message ID: <zevv/duc/issues/329 @github.com>

spikebike commented 1 month ago

Here's an overview https://people.cs.vt.edu/~butta/docs/CCGrid2020-BRINDEXER.pdf

Here's some docs: https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002255en_us&docLocale=en_US&page=GUID-4CDBFE00-A7D5-4A28-8CF6-DEC34E95DBA4.html

It seems like a pretty powerful and flexible tool. I'm using the brindexer/bin/query command, which I believe wraps an API call. I believe Cray's brindexer is usually configured to consume the Lustre change logs, so the DB should be near realtime.

Cray brindexer is much faster than robinhood when configured similarly, consuming change logs, at least for our setup. While brindexer is pretty fast, still pretty annoying for interactive use on large directories. Thus using duc, not to mention our users are used to a dls command (a thin wrapper around duc ls).

So my plan was to periodically run this command instead of the normal duc index: time /opt/cray/brindexer/bin/query --json -q "select size,path from entries_0 join path on pmd5=pathmd5 WHERE type='f'"

Then pipe it through some code I wrote to sum the file and track the totals per dir, was going to try that before trying to track per file totals, which might run the scanner out of ram. If the ram is not an issue, might just be able to add a JSON import into duc. Here's an example: to import:

   {"path":"duc/projects","size":549364}
   {"path":"duc/db/scratch","size":1189888}
   {"path":"duc/scratch","size":2083884716}
spikebike commented 1 month ago

Found a simpler example: $ /opt/cray/brindexer/bin/query --json -C path,name,size /kfs2/projects/MyProj/duc | head -10 {"name":"duc.db","path":"duc/.duc-index","size":275712} {"name":"go1.17.2.linux-amd64.tar.gz","path":"duc/archive","size":0} {"name":"projects","path":"duc/db","size":20480} {"name":"scratch","path":"duc/db","size":4096}

To save ram I'd been collapsing dirs to just show dir totals and not file totals, so above would have 24576 for "duc/db", not sure that's needed. Our larger directories are north of 50M files.

There's also a fullname, but we are trying to keep a database per /project directory, so people can't see info other groups directories. So we put a duc.db in each users dir with the same permission as that user's directory, then a wrapper to automatically pick the right database.

l8gravely commented 1 month ago

"Bill" == Bill Broadley @.***> writes:

Found a simpler example: $ /opt/cray/brindexer/bin/query --json -C path,name,size /kfs2/projects/MyProj/duc | head -10 {"name":"duc.db","path":"duc/.duc-index","size":275712} {"name":"go1.17.2.linux-amd64.tar.gz","path":"duc/archive","size":0} {"name":"projects","path":"duc/db","size":20480} {"name":"scratch","path":"duc/db","size":4096}

To save ram I'd been collapsing dirs to just show dir totals and not file totals, so above would have 24576 for "duc/db", not sure that's needed. Our larger directories are north of 50M files.

So maybe we need to build an 'import' command, but... how to do just incremental updates is the question, because with a chain of directories:

a -> b -> c -> d -> -e -> f -> ... -> t -> u -> v

if we make a change to 't', we need to percollate that change back up the tree properly. Do does your 'query' command return both file and directory changes?

I haven't really done any thinking about incremental updates, but you're more than welcome to take a look a src/libduc/index.c and see how stuff is done.

It might be that you need to sort the JSON report depth first, so changes can get applied down below, and then the tree walked back up with updates as needed.

So how long does 'duc' take to do a full index? On average, since I know you have multiple projects. Unless you can contribute some code and commit to running tests, I don't think I'll be able to do much in the near term. I'm planning on releasing v1.5.0a soon for people to test, but I've still got rough edges to polish.

There's also a fullname, but we are trying to keep a database per /project directory, so people can't see info other groups directories. So we put a duc.db in each users dir with the same permission as that user's directory, then a wrapper to automatically pick the right database.

Sure, that's sorta kinda what I do with the CGI, I have an indexer which generates reports on a per-filesystem basis and the main index.html (see the contributed scripts) builds an index based off the DBs in a directory it finds.

spikebike commented 1 month ago

Sadly brindexer does not keep any per directory totals, just returns the actual metadata size for the dir, just like ls -ald /foo/dir. So you have to basically walk the tree, add things up (for current dir and all parents), and do the work yourself. Much like rbh-du (the robinhood equivalent)

I do have some simple code that just takes the output of brindex and keeps a running total for each dir in ram (including updating parent dirs), thus the concerns about running out of ram. But in my last scan of our filesystem the biggest dir had over 50M files and the JSON was 7.5GB. I don't see any easy way to do updates, and wasn't planning on doing so, at least in the first version.

So I'm just looking for a way to start a new duc.db from scratch with a json import from brindexer.

Do does your 'query' command return both file and directory changes?

I'm pretty sure I could query based on last modified time, not sure it's worth it. Currently running brindexer on our largest dir (50M files) takes about 20 minutes. Something I'd be willing to do daily or so, or by request. That would allow users who get complaints about using too much disk space to quickly tell what their dir sizes are. Your pending top N files sounds useful and top N dirs would be useful as well if that is planned.

So how long does 'duc' take to do a full index?

Worst case for brindexer is 20 minutes on our largest dir, duc index seems around 5 times slower, but duc has to do real I/O not just a DB lookup. Here's an example:

$ time ./duc index /projects/MyProj real 5m17.199s user 0m1.150s sys 1m1.683s $ time /opt/cray/brindexer/bin/query --json -C path,name,size,type /kfs2/projects/MyProj real 1m3.355s user 3m11.541s sys 0m29.087s

The way I used duc before was just to do a daily index and replace the old index when the new index finished. We had some DB corruptions and index crashes, didn't seem worth to try to update the databases, so I just started over with each index.

Unless you can contribute some code and commit to running tests, I don't think I'll be able to do much in the near term.

I'd consider it, would you be likely to accept a pull request that allowed a json import -> new DB? Would you prefer a new binary or a flag to duc index?

The other possibility would be I could just contribute code that could end up in your project's ~/contrib.

I'll review the source files you mentioned, haven't really decided if I should

1) write some C to injest JSON that's preprocessed by my go code to get per directory totals 2) write some go (easier/safer multithreading) to port index.c and write the DB files myself. I noticed tkrzw has go bindings.

l8gravely commented 1 month ago

"Bill" == Bill Broadley @.***> writes:

I'm going to be travelling the rest of this week, so I won't have a ton of time to work on this, but I'll be thinking about it for sure.

Hmm... I wonder if I could setup a small test lustre system at home, even if it's slow, I could play with it. It's an idea at least. And I see that HPE offers RPMs of the brindexer stuff. Hmm...

Sadly brindexer does not keep any per directory totals, just returns the actual metadata size for the dir, just like ls -ald /foo/dir. So you have to basically walk the tree, add things up (for current dir and all parents), and do the work yourself. Much like rbh-du (the robinhood equivalent)

How does brindexer work with child directories? Does it get the results for the selected directory and all child directories?

I do have some simple code that just takes the output of brindex and keeps a running total for each dir in ram (including updating parent dirs), thus the concerns about running out of ram. But in my last scan of our filesystem the biggest dir had over 50M files and the JSON was 7.5GB. I don't see any easy way to do updates, and wasn't planning on doing so, at least in the first version.

So I'd probably just injest and throw away the JSON input as fast as possible. Or is there an API for bindexer you can call directly? I don't have any access to cray stuff (been 26+ years since I last touched Cray hardware, but this all is Lustre FS stuff...

Is this brindexder part of a lustre commercial product?

Can I get it a debian package since that's what I mostly run at home?

So I'm just looking for a way to start a new duc.db from scratch with a json import from brindexer.

Do does your 'query' command return both file and directory changes?

I'm pretty sure I could query based on last modified time, not sure it's worth it.

You misunderstood my question. I was asking if brindexer will return just changes to the size or number of files in a directory, or also the fast that N files changed in the directory?

Currently running brindexer on our largest dir (50M files) takes about 20 minutes. Something I'd be willing to do daily or so, or by request. That would allow users who get complaints about using too much disk space to quickly tell what their dir sizes are. Your pending top N files sounds useful and top N dirs would be useful as well if that is planned.

So how long does 'duc' take to do a full index?

Worst case for brindexer is 20 minutes on our largest dir, duc index seems around 5 times slower, but duc has to do real I/O not just a DB lookup. Here's an example:

$ time ./duc index /projects/MyProj real 5m17.199s user 0m1.150s sys 1m1.683s $ time /opt/cray/brindexer/bin/query --json -C path,name,size,type /kfs2/projects/MyProj real 1m3.355s user 3m11.541s sys 0m29.087s

So this /kfs2/projects/MyProj has 50 million files, but they're spread across a bunch of sub-directories, right, in a tree structure? I've not really had any exposure to Lustre

The way I used duc before was just to do a daily index and replace the old index when the new index finished. We had some DB corruptions and index crashes, didn't seem worth to try to update the databases, so I just started over with each index.

Unless you can contribute some code and commit to running tests,
I don't think I'll be able to do much in the near term.

I'd consider it, would you be likely to accept a pull request that allowed a json import -> new DB? Would you prefer a new binary or a flag to duc index?

I don't know yet, let's see what you get? I think at first it might make sense to just create a new indexer (cmd-brindex.c maybe?) which sucks in the JSON formatted data (or talks to the API directly to save time/space) and generates the duc db from that info.

The other possibility would be I could just contribute code that could end up in your project's ~/ contrib.

Sure! But honestly, I'd be happy to expand duc's reach.

I'll review the source files you mentioned, haven't really decided if I should

  1. write some C to injest JSON that's preprocessed by my go code to get per directory totals

Ugh, no! What is your go code talking to? A lustre API? Does it offer a C interface? Or is it a http type API? Hmm... since I don't know much about go, it does look like it can call C libraries. So maybe calling into the libduc/ stuff from your go code would work? I just hate the idea of going from Go -> JSON -> C when it doesn't make sense.

  1. write some go (easier/safer multithreading) to port index.c and write the DB files myself. I noticed tkrzw has go bindings.

That might be a better thing to do, but then your code would definitely live in contrib/ because we couldn't be expected to keep things in sync. You would have to write to a specific duc DB format version and make sure you check it.

It certainly sounds like a fun project!

Please feel free to post some sample code, and if you have some instructions on setting up a simple lustre test case, that would be ideal.

John

spikebike commented 1 month ago

Hmm... I wonder if I could setup a small test lustre system at home, even if it's slow, I could play with it. It's an idea at least. And I see that HPE offers RPMs of the brindexer stuff. Hmm...

Lustre is open source, wouldn't be my first choice for a small test system for a parallel filesystem, but it's got a common design for parallel filesystems. A metadata server that tracks all file metadata, including which OSTs have which ranges of blocks. The lustre driver is in the kernel (which is kinda painful, but performant) so you talk to the metadata server (MDS) and it tells you which block ranges are on which OST. Each OST has a has a native filesystem (Ext4 and ZFS are common) to hold the blobs, but they don't look like normal files, not like ~joeUser is on OST1 and ~bobUser is on OST2 or anything. Stripping of directories and/or files can be across 1 or more OSTs and can be configured at runtime.

I dug around some, and it seems like brindexer is part of clusterstor. I did find a 5 year old copy on github, but no license file, or signs of life. It seems like it's part of clusterstor's police engine, not clear if it's open source.

How does brindexer work with child directories? Does it get the results for the selected directory and all child directories?

Yes, much like find. You can make queries, so you could say for a given dir tree (and subdirs) list all files, all dirs, all files over 1GB, and any SQL query based on metadata, including timestamps. You can use this with a policy engine to say things like all big files not touched in a month go to cheap storage.

Some examples/overview: https://wiki.lustre.org/images/8/8b/LUG2024-Scalable_Auto_Tiering-Jabas.pdf

So I'd probably just injest and throw away the JSON input as fast as possible.

Sensible, or just have duc json-index accept a pipe.

If you are interested in brindexer, I found a comparison between it and GUFI that has fair bit of detail: https://dl.acm.org/doi/pdf/10.5555/3571885.3571960

Presumably if we can get brindexer -> json -> duc working it should be pretty simple to do the same for GUFI.

Is this brindexder part of a lustre commercial product?

I'm 90% sure it is, can't find any hint of a source repo, except for things like this that I don't think work: import ( "cray.com/brindexer/fsentity" "cray.com/brindexer/indexing" "cray.com/brindexer/scan" )

And stale and possibly not licensed: https://github.com/arnabkpaul/cray_brindexer/

GUFI sounds pretty similar and open source: https://github.com/mar-file-system/GUFI

The quickstart to build the code, build and index, and query the index looks very simple and easy.

What I don't know is if GUFI can ingest Lustre changelogs. I'd rather not walk 100PB just to find the new files.

You misunderstood my question. I was asking if brindexer will return just changes to the size or number of files in a directory, or also the fast that N files changed in the directory?

I believe you can write any SQL query on the metadata, so something like all files since midnight should work. It would be somewhat painful since a dir walk might take 20 minutes, but if you track the timestamp of the newest file in each dir, then use that date for incremental updates (select files newer than TIMESTAMP) should work.

So this /kfs2/projects/MyProj has 50 million files, but they're spread across a bunch of sub-directories, right, in a tree structure? I've not really had any exposure to Lustre

Heh, not sure, I launched a DB query, hopefully it takes less than 20 minutes. From the perspective of duc, brindex, and similar it's just a filesystem, nothing lustre specific is required. Various knobs impact performance (like caching, different pool performance characteristics, and striping across OSTs), but all the usual commands like find, du, ls, etc work the same. The pain points are it's easy to have a dir with 50M files under it and that can take a long time to run find, ls, or du on, thus the need for duc.

Ah: $ /opt/cray/brindexer/bin/query --json -q "select size,path from entries_0 join path on pmd5=pathmd5 WHERE type='d'" /projects/MyProj | wc -l 10857063

Ouch, 10M dirs for 50M files.

I don't know yet, let's see what you get? I think at first it might make sense to just create a new indexer (cmd-brindex.c maybe?) which sucks in the JSON formatted data (or talks to the API directly to save time/space) and generates the duc db from that info.

Sounds good, I'll give it a shot.

Ugh, no! What is your go code talking to? A lustre API? Does it offer a C interface? Or is it a http type API? Hmm... since I don't know much about go, it does look like it can call C libraries. So maybe calling into the libduc/ stuff from your go code would work? I just hate the idea of going from Go -> JSON -> C when it doesn't make sense.

My go code is mostly:

            components := strings.Split(record.Path, "/")
            for i := 1; i <= len(components); i++ {
                prefix := strings.Join(components[:i], "/")
                directorySizes[prefix] += record.Size
            }

Basically for each parent dir add the record size to it. Nice and concise, but my favorite part of go is the thread safe multiple producer -> multiple consumer channels that are part of the language standard. Making it very easy to throw X CPUs at Y bits of work and have it "just work".

Please feel free to post some sample code, and if you have some instructions on setting up a simple lustre test case, that would be ideal.

It's a fair bit of work, involves recompiling the kernel, here's a good overview: https://wiki.lustre.org/Installing_the_Lustre_Software

Generally I'd consider cephFs to be easier to setup and manage, but either will be a good intro into parallel filesystems. I'd plan on at least 3 nodes (1 Metadata and 2 OSTs), but they could be virtual. Ceph is included in ubuntu, probably in debian as well.