zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
584 stars 79 forks source link

Exceeding max DB size? #259

Open spikebike opened 4 years ago

spikebike commented 4 years ago

I compiled duc-1.4.4, I ran:

duc index --debug -p -d /projects/foo/.duc-index/duc.db /projects/foo

It ran much longer than I expected, and if I strace I get an seemingly infinite and unchanging:

pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256

Maybe I've hit the max DB size:

$ ls -alh /projects/foo/.duc-index/duc.db
-rwxrwx--- 1 130348 foo 1.8G Aug 20 20:08 /projects/foo/.duc-index/duc.db

I ran strace for a few seconds and put the log in /tmp/foo.log

$ cat /tmp/foo.log | grep pread64 | wc -l
275003
$ cat /tmp/foo.log | grep pread64 | sort | uniq
pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256
pread64(3, strace: Process 9187 detached
$

File handle 3 above is pointing at the duc.db file:

$ ls -al /proc/9187/fd/3
lrwx------ 1 root root 64 Aug 21 09:37 /proc/9187/fd/3 -> /projects/foo/.duc-index/duc.db

The filesystem has 4PB free, and I don't believe it's been full since I started the index.

l8gravely commented 4 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> I compiled duc-1.4.4, I ran:

Bill> duc index --debug -p -d /projects/foo/.duc-index/duc.db /projects/foo

Bill> It ran much longer than I expected, and if I strace I get an Bill> seemingly infinite and unchanging:

Bill> pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256

That doesn't help me unfortunately, I'd really need to see a backtrace. But I wonder if the depth of your directories is hitting the PATH_MAX by any chance? Do you want to try a silly hack and see if that makes any difference?

In src/libduc/duc.h is

define PATH_MAX 16384

and I'd like you try and double it if possible and see if that makes it through your index.

Bill> Maybe I've hit the max DB size:

Bill> $ ls -alh /projects/foo/.duc-index/duc.db Bill> -rwxrwx--- 1 130348 foo 1.8G Aug 20 20:08 /projects/foo/.duc-index/duc.db

Nah, I've run into indexes of that size myself, where I have 14Tb and 100 million files because my users are idiots.

Bill> I ran strace for a few seconds and put the log in /tmp/foo.log

Bill> $ cat /tmp/foo.log | grep pread64 | wc -l Bill> 275003 Bill> $ cat /tmp/foo.log | grep pread64 | sort | uniq Bill> pread64(3, "\260\27\253\5$\320\0051CY\342\371\270\352\2R\201\343\6\201=\r\262Z\35M`\0006\4W\17"..., 256, 1655162880) = 256 Bill> pread64(3, strace: Process 9187 detached Bill> $

Bill> File handle 3 above is pointing at the duc.db file:

Bill> $ ls -al /proc/9187/fd/3 Bill> lrwx------ 1 root root 64 Aug 21 09:37 /proc/9187/fd/3 -> /lustre/eaglefs/projects/foo/.duc-index/duc.db

Bill> The filesystem has 4PB free, and I don't believe it's been full since I started the index.

How many files does your filesystem have? If you have lots of projects under /lustre/eaglefs/projects/, maybe it's time to just index each project seperately? I know it's not ideal, but maybe it will get you going.

In any case, maybe you can do:

$ gdb duc
gdb> run -d /projects/foo/.duc-index/dub.db /projects/foo

Though I would also suggest that you don't put the duc DB into the same directory as the one you are indexing, though it shouldn't really matter actually.

Anyway, let the above run until it crashes, and then do a 'bt' to get a back trace.

Or if you have a core file from a previous crash, do:

$ gdb duc core gbd> bt

and send that to me directly if you want to keep details out of github. I won't share. I can't promise I'll have a good answer right away though...

Thanks for all your testing! John

spikebike commented 3 years ago

I found another wedged duc, I don't think it's pathmax, unless it's 100 characters or so.

I checked on the duc process and all the file handles are not changing, even when I look 10 minutes later. duc.db is suspiciously near 2GB again:

-rwxrwx--- 1 130348 baz 2019534848 Aug 30 11:21 /projects/baz/.duc-index/duc.db

Is there any hard limit to the number of sub directories duc can handle in a single directory?

Not seeing any I/O related timeouts, just things like the below that doesn't change:

lrwx------ 1 root root 64 Sep 21 07:38 0 -> /dev/pts/3
lrwx------ 1 root root 64 Sep 21 07:38 1 -> /dev/pts/3
lr-x------ 1 root root 64 Sep 21 07:38 10 -> /projects/foo/results/simulation_output/run112/depth119/run
l-wx------ 1 root root 64 Sep 16 14:10 2 -> /projects/baz/.duc-index/duc.log
lrwx------ 1 root root 64 Sep 21 07:38 3 -> /projects/baz/.duc-index/duc.db
lr-x------ 1 root root 64 Sep 21 07:38 4 -> /projects/baz
lr-x------ 1 root root 64 Sep 21 07:38 5 -> /projects/foo
lr-x------ 1 root root 64 Sep 21 07:38 6 -> /projects/foo/results
lr-x------ 1 root root 64 Sep 21 07:38 7 -> /projects/foo/results/simulation_output
lr-x------ 1 root root 64 Sep 21 07:38 8 -> /projects/foo/results/simulation_output/run112
lr-x------ 1 root root 64 Sep 21 07:38 9 -> /projects/foo/results/simulation_output/run112/depth119

strace -p on the duc process shows a near infinite number of:

So near infinite reads of the same string from duc.db

I forced it to core with gcore, and used gdb for the backtrace:

Thread 1 (Thread 0x7fffedadd900 (LWP 10850)):
#0  0x00007fffebbec033 in __pread_nocancel () at /lib64/libc.so.6
#1  0x00007fffec99e3c4 in tchdbseekreadtry () at /lib64/libtokyocabinet.so.9
#2  0x00007fffec99e492 in tchdbreadrec () at /lib64/libtokyocabinet.so.9
#3  0x00007fffec99f31e in tchdbputimpl () at /lib64/libtokyocabinet.so.9
#4  0x00007fffec9a5871 in tchdbput () at /lib64/libtokyocabinet.so.9
#5  0x00007fffec9a9621 in tcbdbleafsave () at /lib64/libtokyocabinet.so.9
#6  0x00007fffec9a9c55 in tcbdbleafcacheout () at /lib64/libtokyocabinet.so.9
#7  0x00007fffec9ad441 in tcbdbcacheadjust () at /lib64/libtokyocabinet.so.9
#8  0x00007fffec9ada72 in tcbdbputimpl () at /lib64/libtokyocabinet.so.9
#9  0x00007fffec9ae98f in tcbdbput () at /lib64/libtokyocabinet.so.9
#10 0x000000000040450c in db_put (db=<optimized out>, key=key@entry=0x7fffffff9000, key_len=<optimized out>, val=<optimized out>, val_len=<optimized out>) at src/libduc/db-tokyo.c:106
        r = <optimized out>
#11 0x0000000000406086 in scanner_free (scanner=scanner@entry=0x1958bf0) at src/libduc/index.c:549
        key = "83d08f32/20012fe3e015fa9\000\213\225\001\000\000\000"
        devino = 0x1958c50
        keyl = <optimized out>
        r = <optimized out>
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
#12 0x0000000000407ad1 in scanner_scan (scanner_dir=scanner_dir@entry=0xea02a0) at src/libduc/index.c:479
        scanner_ent = 0x1958bf0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144136071247126441,
          st_nlink = 2,
          st_mode = 16877,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597972316,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x1871ca3 "run",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144136071247126441
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#13 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0xfcc280) at src/libduc/index.c:478
        scanner_ent = 0xea02a0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144136071247126438,
          st_nlink = 3,
          st_mode = 16888,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597972316,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565899532,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x5be5933 "depth119",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144136071247126438
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#14 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x1e9d8e0) at src/libduc/index.c:478
        scanner_ent = 0xfcc280
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898408178300,
          st_nlink = 1,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 20205568,
          st_blksize = 4096,
          st_blocks = 39512,
          st_atim = {
            tv_sec = 1597971033,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565969670,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565969670,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x1961c5b "run112",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 20230144,
            apparent = 20205568,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898408178300
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#15 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0xf80050) at src/libduc/index.c:478
        scanner_ent = 0x1e9d8e0
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898173277655,
          st_nlink = 33,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565870580,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565870580,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x116895b "simulation_output",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898173277655
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#16 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x64c430) at src/libduc/index.c:478
        scanner_ent = 0xf80050
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135898173277654,
          st_nlink = 5,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 4096,
          st_blksize = 4096,
          st_blocks = 8,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1565969783,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1565969783,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x64cecb "results",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 4096,
            apparent = 4096,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135898173277654
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#17 0x0000000000407ac9 in scanner_scan (scanner_dir=scanner_dir@entry=0x643f60) at src/libduc/index.c:478
        scanner_ent = 0x64c430
        name = <optimized out>
        st_ent = {
          st_dev = 2211483442,
          st_ino = 144135941123016226,
          st_nlink = 5,
          st_mode = 17912,
          st_uid = 131078,
          st_gid = 130988,
          __pad0 = 0,
          st_rdev = 0,
          st_size = 36864,
          st_blksize = 4096,
          st_blocks = 72,
          st_atim = {
            tv_sec = 1597912516,
            tv_nsec = 0
          },
          st_mtim = {
            tv_sec = 1583785545,
            tv_nsec = 0
          },
          st_ctim = {
            tv_sec = 1583785545,
            tv_nsec = 0
          },
          __unused = {0, 0, 0}
        }
        ent = {
          name = 0x64431b "parametric_master",
          type = DUC_FILE_TYPE_DIR,
          size = {
            actual = 36864,
            apparent = 36864,
            count = 1
          },
          devino = {
            dev = 2211483442,
            ino = 144135941123016226
          }
        }
        duc = 0x62e430
        req = 0x62e690
        report = 0x63ff00
        r = <optimized out>
        e = <optimized out>
#18 0x0000000000407ca6 in duc_index (req=0x62e690, path=<optimized out>, flags=flags@entry=(unknown: 0)) at src/libduc/index.c:632
        duc = 0x62e430
        path_canon = 0x63fe90 "/projects/baz99"
        report = 0x63ff00
        scanner = 0x643f60
#19 0x000000000040d8f8 in index_main (duc=0x62e430, argc=<optimized out>, argv=<optimized out>) at src/duc/cmd-index.c:102
        report = <optimized out>
        siz_apparent = '\000' <repeats 31 times>
        siz_actual = '\000' <repeats 31 times>
        index_flags = (unknown: 0)
        open_flags = <optimized out>
        r = <optimized out>
#20 0x000000000040390d in main (argc=1, argv=0x7fffffffda88) at src/duc/main.c:177
        r = <optimized out>
        duc = 0x62e430
        cmd = 0x618de0 <cmd_index>
        ducrc = 0x62e450
        home = 0x7fffffffed83 "/root"
        log_level = <optimized out>
(gdb) quit
l8gravely commented 3 years ago

Bill, Thanks for the report. I've looked at my own DBs and the largest one I have is 1.2gb, but using TokyoCabinet (like you're using) we already do

int opts = BDBTLARGE; if(compress) opts |= BDBTDEFLATE; tcbdbtune(db->hdb, -1, -1, -1, -1, -1, opts);

Before we open the DB, so duc should be able to open and use DBs larger than 2gb in size. Not sure how to test this without writing some code to fill it in by hand... But maybe we need to do some tuning for bigger numbers. Can you try the patch at the end for me?

Looking at your gdb trace (thank you!) I'm interested in:

    st_ent = {
          st_dev = 2211483442,
              st_ino = 144136071247126441,
              st_nlink = 2,
              st_mode = 16877,
              st_uid = 131078,
      st_gid = 130988,

And how large the st_ino and st_uid values are. On linux at least, the st_uid is a 32bit integer, so that st_uid is fine. The st_ino is a 64bit number, so even though that looks huge, it fits within 64bits. Hmmm... Let's try this if you don't mind:

diff --git a/src/libduc/db-tokyo.c b/src/libduc/db-tokyo.c index 1eb7925..3eae431 100644 --- a/src/libduc/db-tokyo.c +++ b/src/libduc/db-tokyo.c @@ -61,8 +61,12 @@ struct db db_open(const char path_db, int flags, duc_errno *e)

int opts = BDBTLARGE;
if(compress) opts |= BDBTDEFLATE;

And compile and test this out and let us know if it craps out again. Basically, I just bumped all the defaults up by two times their defaults. It's not clear to me if this will break things with older DBs, so I'd recommend you be careful writing to old DBs using this new code.

Thanks, John

l8gravely commented 3 years ago

And I'm a moron and mis-read the docs and sent my patch before I did a test compile. Please use this patch instead. Those two parameters 512 and 2048 needed to be the exponents, not the values. Duh... but I still don't know if this will fix the problem:

diff --git a/src/libduc/db-tokyo.c b/src/libduc/db-tokyo.c index 1eb7925..3006113 100644 --- a/src/libduc/db-tokyo.c +++ b/src/libduc/db-tokyo.c @@ -61,8 +61,12 @@ struct db db_open(const char path_db, int flags, duc_errno *e)

    int opts = BDBTLARGE;
    if(compress) opts |= BDBTDEFLATE;

Also, how much memory was your duc index process using when it got into the loop? How much memory is on the indexing system?

John

spikebike commented 3 years ago

I will try your patch and rerun on that directory. The system has 192GB ram (188 as reported by free -h). I didn't look at the memory size, but it didn't cause any problems. Next time it hangs I'll get you the memory size. I suspect this will take a few days so I'll let you know. I'll delete the old database before launching.

spikebike commented 3 years ago

I recompiled, deleted the DB, and It's running, I did find a directory:

$ ls -al |grep drw| wc -l
350002

So far:

[-#------] Indexed 5.0Gb in 72.8K files and 24.2K directories
l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> I recompiled, deleted the GB, and It's running, I did find a directory: Bill> $ ls -al |grep drw| wc -l Bill> 350002

Bill> So far:

Bill> [-#------] Indexed 5.0Gb in 72.8K files and 24.2K directories

Do you know how many files the filesystem has? What does 'df -i' tell you? Also, how is the size of the DB now? We should be supporting larger than 2gb DBs without any problem... but who knows.

I wish I had a way to build a test case for this. Maybe I can feed in junk data somehow to catch this type of error.

spikebike commented 3 years ago

df -i works, but is for the entire projects dir, but I'm only running duc on one of 250 sub directories.

Filesystem                          Inodes IUsed IFree IUse% Mounted on
<ip>@ib:<ip>@ib:/eaglefs   6.6G  1.6G  5.0G   25% /projects

At one point there was a ton of directories, only somewhat more files, and not much storage. So 2-3 files per dir, and obviously very small files.

[----#---] Indexed 594.8Gb in 8.5M files and 2.8M directories

Currently:

[-----#--] Indexed 8.0Tb in 12.6M files and 4.2M directories
[--#-----] Indexed 8.4Tb in 18.0M files and 5.9M directories
154M Sep 23 21:45 /projects/<foo>/.duc-index/duc.db
228M Sep 24 08:45 /projects/<foo>/.duc-index/duc.db

I believe the directory (last I looked) and it's around 12.3TB. It would be very cool if this rate keeps up, but I fear quite a few million files remain to be scanned.

l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> df -i works, but is for the entire projects dir, but I'm only Bill> running duc on one of 250 sub directories.

Bill> Filesystem Inodes IUsed IFree IUse% Mounted on Bill> ip>@ib:<ip@ib:/eaglefs 6.6G 1.6G 5.0G 25% /projects

That's a crap load of files! I bet incremental backups suck suck suck... unless you can do block level snapshots or something like that.

Bill> At one point there was a ton of directories, only somewhat more Bill> files, and not much storage. So 2-3 files per dir, and obviously Bill> very small files.

Bill> [----#---] Indexed 594.8Gb in 8.5M files and 2.8M directories

Bill> Currently:

Bill> [------#-] Indexed 7.8Tb in 10.0M files and 3.3M directories Bill> 122M Sep 23 14:48 /projects//.duc-index/duc.db

I've got filesystems with 75+ million files in under 8Tb of space, which duc handles without any problems.

Bill> I believe the directory (last I looked) and it's around Bill> 12.3TB. It would be very cool if this rate keeps up, but I fear Bill> quiet a few million files remain to be scanned.

It should handle that many without problems. I'm starting to wonder if it's the depth thats killing us, but that might be because the hashing of TokyoCabinet starts crapping out due to hash collisions.

Just scanning all those files takes a long time, so getting something repeatable isn't going to be quick.

Do you have any ideas about how long your PATH is to the deepest level of the filesystem is? I'm thinking that maybe PATH_MAX is getting hit somewhere... which should be easy enough to test with a simple test case.

John

spikebike commented 3 years ago

No particular idea on PATH_MAX, but I'm looking at all open file handles once every 10 minutes. So far I've not seen much over 110 characters for the path. And the max number of file handles duc is using is around 10.

Not sure what hash is used, but this mix does seem pretty challenging, if it's not a good hash. There's directories called upXX from 01 to 40. In each of those is a sub directory called baz0000001 to baz0350000, and each of those has 6 files and 2 directories.

So then number of dirs is something like 40 350,000 2 = 28M directories and 40 350,000 6 = 84M files.

Do you think it's worth writing a wrapper for tokyocabinet to accept paths on so we can pump millions of pathnames quickly? I could write a bit of code to generate the that would likely be 1000s of times faster than walking a filesystem.

Something like:

#!/bin/bash
base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function
for i in `seq -w 1 40`; do
    for j in `seq -w 1 350000`; do
#dirs
        echo $base/bazfoo$i/$j;
        echo $base/bazfoo$i/$j/run;
#files
        echo $base/bazfoo$i/$j/in.osw
        echo $base/bazfoo$i/$j/out.osw
        echo $base/bazfoo$i/$j/singularity_output.log
        echo $base/bazfoo$i/$j/run/data_point_out.json
        echo $base/bazfoo$i/$j/run/data_point.zip
        echo $base/bazfoo$i/$j/run/finished.job
    done
done

On a random server I ran 1 of the 40 loops and it took under a minute, so the above script should take approximately 40 minutes to produce the listing of the 28M directories and 84M files.

It's possible that your patch fixed things, so maybe we should just wait. Currently:

[#-------] Indexed 8.5Tb in 19.5M files and 6.4M directories
l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> No particular idea on PATH_MAX, but I'm looking at all open file Bill> handles once every 10 minutes. So far I've not seen much over Bill> 110 characters for the path. And the max number of file handles Bill> duc is using is around 10.

That's good info.

Bill> Not sure what hash is used, but this mix does seem pretty Bill> challenging, if it's not a good hash. There's directories Bill> called upXX from 01 to 40. In each of those is a sub directory Bill> called baz0000001 to baz0350000, and each of those has 6 files Bill> and 2 directories.

Yikes! It's like you're an old time NNTP news spool directory structure! Are the six files and two directories in there also named in a regular pattern, or are they much more variable?

Bill> So then number of dirs is something like 403500002 = 28M Bill> directories and 403500006 = 84M files.

Pretty big, but as I've said I've got one filesystem with 75+ million files, so now I'm suspecting we're hitting hash bucket collissions here.

Bill> Do you think it's worth writing a wrapper for tokyocabinet to Bill> accept paths on so we can pump millions of pathnames quickly? I Bill> could write a bit of code to generate the that would likely be Bill> 1000s of times faster than walking a filesystem.

Yeah, it would be a simpler test to feed all this data in and see what it does without hitting the filesystem scan. I'll have to think about how to do this.

Another thought is to try and compile with a new backend DB if you have the time and inclination. I guess I'd go with LevelDB first off, but maybe KyotoCabinet would also work since it's a newer(ish) version of the same DB type as Tokyocabinet, but without the problems.

Bill> Something like:

Bill> #!/bin/bash Bill> base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function Bill> for i in seq -w 1 40; do Bill> for j in seq -w 1 350000; do Bill> #dirs Bill> echo $base/bazfoo$i/$j; Bill> echo $base/bazfoo$i/$j/run; Bill> #files Bill> echo $base/bazfoo$i/$j/in.osw Bill> echo $base/bazfoo$i/$j/out.osw Bill> echo $base/bazfoo$i/$j/singularity_output.log Bill> echo $base/bazfoo$i/$j/run/data_point_out.json Bill> echo $base/bazfoo$i/$j/run/data_point.zip Bill> echo $base/bazfoo$i/$j/run/finished.job Bill> done Bill> done

Yeah, time to see if it's simpler to hack duc to do this, or to just push data into tokyocabinet DB directly. Hmm... it might be simplest to use the 'tcbtest' and 'tcbmgr' tools to build a crappy test to see if it blows up. More in a bit as I play with this...

Bill> It's possible that your patch fixed things, so maybe we should just wait. Currently:

All I really did was double the hash structures and such, not any smart tuning by any means.

Bill> [#-------] Indexed 8.5Tb in 19.5M files and 6.4M directories

Good to see!

l8gravely commented 3 years ago

Bill, I'm running the following script... though it's not very fast, takes about 30 seconds to do 1000 inserts of the two directories and six files into the test DB.

My script looks like this, thanks to your base it was easy to whip up!

!/bin/bash

db=/tmp/test.db

base=/this/is/just/to/take/up/90/chars/or/so/to/be/an/accurate/test/of/the/hash/function

tcbmgr create -tl $db for i in seq -w 1 40; do echo i=$i for j in seq 1 350000; do

dirs

tcbmgr put $db "$base/bazfoo$i/$j" 1
tcbmgr put $db "$base/bazfoo$i/$j/run" 1

#files
tcbmgr put $db "$base/bazfoo$i/$j/in.osw" 10
tcbmgr put $db "$base/bazfoo$i/$j/out.osw" 20
tcbmgr put $db $base/bazfoo$i/$j/singularity_output.log 30
tcbmgr put $db $base/bazfoo$i/$j/run/data_point_out.json 40
tcbmgr put $db $base/bazfoo$i/$j/run/data_point.zip 50
tcbmgr put $db $base/bazfoo$i/$j/run/finished.job 60
if (( $j % 1000 == 0 )); then
    echo -n "."
fi
done

done

l8gravely commented 3 years ago

Bill, Can you do the following on your broken DB? It would be interesting to see if we get numbers out of it:

tcbmgr inform /path/to/duc.db

Here's some numbers I got from a random duc DB I had hanging around:

$ tcbmgr inform /tmp/duc.db path: /tmp/duc.db database type: btree additional flags: comparison function: lexical max leaf member: 128 max node member: 256 leaf number: 1594 node number: 11 bucket number: 32749 used bucket number: 1605 alignment: 256 free block pool: 1024 inode number: 2228415 modified time: 2020-08-17T19:26:55-05:00 options: large deflate record number: 38490 file size: 8370944

And here's the duc info -d /tmp/duc.db info as well.

$ ./duc info -d /tmp/duc.db | head -2 Date Time Files Dirs Size Path 2020-08-17 20:26:49 618.9K 37.8K 413.5G /home/john

So this is just a small DB of my home machine, and you can see the "record number" field pretty much gels with the number of dirs in the DB. So actually thinking even more about it, the test I'm running is going to push even more info into the DB than we really need, since duc itself only records directory size and file counts. Eh... it will be an ok test to see if we can break it quicker than actually scanning the filesystem.

I'll let it run until it breaks or it ends and post more info then.

John

spikebike commented 3 years ago

First the update:

[-------#] Indexed 9.4Tb in 33.5M files and 11.1M directories
432M Sep 25 12:14 /projects/foo/.duc-index/duc.d

l8gravely> Yikes! It's like you're an old time NNTP news spool directory l8gravely> structure! Are the six files and two directories in there also named l8gravely> in a regular pattern, or are they much more variable?

Heh, indeed, not many around recognize NNTP these days. When I was a wee lad I used to read 100% of usenet, of course that didn't last long.

The particularly problematic directory is near identical to that produced by my example script, one of 40 dirs with 350k dirs in each one, and that subdir has 6 files and one more subdir.

Other dirs are fine, like this one:

Date        Time     Files     Dirs    Size Path
2020-08-13 18:37:49  222.0M   52.5M   95.1T /projects/civid
2.4G Sep  9 06:43 civid.db

So it's clearly not that 83M files and 23M directories is the problem.

l8gravely> Another thought is to try and compile with a new backend DB if you l8gravely> have the time and inclination. I guess I'd go with LevelDB first off, l8gravely> but maybe KyotoCabinet would also work since it's a newer(ish) version l8gravely> of the same DB type as Tokyocabinet, but without the problems.

I'm game, and thanks for mentioning KyotoCabinet vs Tokyocabinet I hadn't noticed. I'll let this one finish (or fail) first though.

I had hoped that some 750 directories (250 under /projects and 500 under /scratch) would be fine grained enough to allow reasonable runtimes by running them in parallel. I'm considering a few options:

  1. A parallel scanner written in go piped to duc or the storage layer more directly
  2. Something that can parse Lustre changelogs and pipe to duc and/or storage
  3. Some way to run multiple duc scans on each of /projects/foo/up[00-40] and get the result into a single duc.db

Your tcbmgr script is the right idea, but each put requires launching tcpmgr which then has to do a full walk of the database/index before inserting. Considering that, 30 per second is pretty impressive. I'll look into writing a small wrapper to allow piping the directories so we can just stream 100M test file paths to it, without doing any I/O outside of writing to the DB. On a SSD that should be pretty quick and would let us figure out what patterns cause it to explode much more quickly, once that works I can walk all 15-20PB looking for any problematic patterns.

If that's too slow might be be better to slurp up say 1M paths into a small btree, and then walk both btrees (small and Tokyocabinet) that way you don't walk the entire tree per file.

l8gravely> Can you do the following on your broken DB?

Next time, yes. I don't have any broken ones at the moment. Here's one of the larger working ones:

$ tcbmgr inform ./civic.db
path: ./civic.db
database type: btree
additional flags:
comparison function: lexical
max leaf member: 128
max node member: 256
leaf number: 624808
node number: 3590
bucket number: 32749
alignment: 256
free block pool: 1024
inode number: 162133303107829870
modified time: 2020-09-09T05:43:06-07:00
options: large deflate
record number: 52472956
file size: 2483456512
l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> First the update: Bill> [-------#] Indexed 9.4Tb in 33.5M files and 11.1M directories Bill> 432M Sep 25 12:14 /projects/foo/.duc-index/duc.d

l8gravely> Yikes! It's like you're an old time NNTP news spool directory l8gravely> structure! Are the six files and two directories in there also named l8gravely> in a regular pattern, or are they much more variable?

Bill> Heh, indeed, not many around recognize NNTP these days. When I Bill> was a wee lad I used to read 100% of usenet, of course that Bill> didn't last long.

Bill> The particularly problematic directory is near identical to that Bill> produced by my example script, one of 40 dirs with 350k dirs in Bill> each one, and that subdir has 6 files and one more subdir.

I'm suspecting more and more that we're hitting a pathalogical case here with the hashing function. If you look in libduc/index.c near the end, you'll see this:

char key[32]; struct duc_devino *devino = &scanner->ent.devino; size_t keyl = snprintf(key, sizeof(key), "%jx/%jx", (uintmax_t)devino->dev, (uintmax_t)devino->ino); int r = db_put(duc->db, key, keyl, scanner->buffer->data, scanner->buffer->len);

And I suspect that maybe this key isn't big enough in your case to get around hash collisions because of how similiar your paths are. So this is another place where we could possibly hack the code. But if the change I made before fixes things for you, then that's probably a better way since I think it's backwardsly compatible with older DBs.

You're directory setup probably requires some tweaks to how we hash things to make it work better. I'm not sure how to do this without a custom code at this point if the previous hack doesn't work.

Would it be possible for you to change your setup to NOT have 320,000 directories in a single directory? Do the old NNTP trick of making a level or two of directories down so that you only have a few thousand directories inside any one directory?

Bill> Other dirs are fine, like this one:

Bill> Date Time Files Dirs Size Path Bill> 2020-08-13 18:37:49 222.0M 52.5M 95.1T /projects/civid Bill> 2.4G Sep 9 06:43 civid.db

Bill> So it's clearly not that 83M files and 23M directories is the problem.

Yup, that's not the problem. But is your civid/ tree as regular as the problem tree?

l8gravely> Another thought is to try and compile with a new backend DB l8gravely> if you have the time and inclination. I guess I'd go with l8gravely> LevelDB first off, but maybe KyotoCabinet would also work l8gravely> since it's a newer(ish) version of the same DB type as l8gravely> Tokyocabinet, but without the problems.

And just to expand on this, we're thinking of changing the backend support with the next major release to drop support for older backends, to make things simpler for us. Especially since TokyoCabinet is not actively developed any more. And I think even Kyotocabinet is a bit un-maintained as well.

Bill> I'm game, and thanks for mentioning KyotoCabinet vs Tokyocabinet Bill> I hadn't noticed. I'll let this one finish (or fail) first Bill> though.

Bill> I had hoped that some 750 directories (250 under /projects and Bill> 500 under /scratch) would be fine grained enough to allow Bill> reasonable runtimes by running them in parallel. I'm considering Bill> a few options:

Bill> 1. A parallel scanner written in go piped to duc or the storage layer more directly

Unless your storage backend can handle it, it's not going to give you much gain. The big cost is the actual openddir(), readdir(), stat(), closedir() loop.

Way back when I was first poking at this when it was Philesight and written in Ruby, I had hacked together a parallel descent version which did N parallel threads. This was before I realized that Ruby had the problem if the Global Thread Lock, so while you could write parallel code, it wasn't really parallel.

Zevv re-wrote the core code into C which did speed up things quite a bit, but making it a multi-writer DB tool is alot more work. It might work if you had one writer, and multiple children which descended into a tree to read for you. But keeping it all in sync isn't trivial.

Bill> 2. Something that can parse Lustre changelogs and pipe to duc and/or storage

Is there any way to break it down by Lustre filesystem?

Bill> 3. Some way to run multiple duc scans on each of Bill> /projects/foo/up[00-40] and get the result into a single Bill> duc.db

This is not an easy thing to do. But maybe an offline merge to take two DBs and merge them together might be an interesting tool to write. Have to think about it.

But

Bill> Your tcbmgr script is the right idea, but each put requires Bill> launching tcpmgr which then has to do a full walk of the Bill> database/index before inserting. Considering that, 30 per second Bill> is pretty impressive. I'll look into writing a small wrapper to Bill> allow piping the directories so we can just stream 100M test Bill> file paths to it, without doing any I/O outside of writing to Bill> the DB. On a SSD that should be pretty quick and would let us Bill> figure out what patterns cause it to explode much more quickly, Bill> once that works I can walk all 15-20PB looking for any Bill> problematic patterns.

Great! I started this last night, but ended up going with the tcbmgr script, which I couldn't get to blow up. I suspect I wasn't pushing data into it quite the same way.

Mostly because duc builds it's key with "device_id/inode_num" and pushes in the data that way. So ... it's replicating that part that needs to happen I suspect.

So I just stopped my shell script for now since I had grown the DB file to 3+ gb without any problems.

Bill> If that's too slow might be be better to slurp up say 1M paths Bill> into a small btree, and then walk both btrees (small and Bill> Tokyocabinet) that way you don't walk the entire tree per file.

l8gravely> Can you do the following on your broken DB?

Bill> Next time, yes. I don't have any broken ones at the Bill> moment. Here's one of the larger working ones:

Bill> $ tcbmgr inform ./civic.db Bill> path: ./civic.db Bill> database type: btree Bill> additional flags: Bill> comparison function: lexical Bill> max leaf member: 128 Bill> max node member: 256 Bill> leaf number: 624808 Bill> node number: 3590 Bill> bucket number: 32749 Bill> alignment: 256 Bill> free block pool: 1024 Bill> inode number: 162133303107829870 Bill> modified time: 2020-09-09T05:43:06-07:00 Bill> options: large deflate Bill> record number: 52472956 Bill> file size: 2483456512

This in interesting, we see approx 52+ million entries, matching the number of directories you have. So they must not be nearly as pathalogical as the other tree.

How is your Lustre filesystem built? What is it based on for underlying filesystems? I suspect XFS if only because I think ext3/4 would simply bog down with 320,000 directories in a single directory. But I could be wrong here.

But as I've said before, splitting up your 250 projects into individual DBs gives your all the parallelism you need, then you have a single index script (see the examples/index.cgi) which takes all your DBs and generates the CGI files for each one, as well as the master index file you use to examine each project.

Now finding some way to tie them together into a single view would be interesting. It might be possible with a new DB format which is purely acting as a way to join DBs into a single view. Have to think about this.

It would keep the individual DBs for each tree/project/fileystem/etc, but give you a way to merge them into a single CGI interface so that users can browse things nicely.

Off the cuff thinking:

duc join -d master.db -a projects/*.db -a adds one or more DBs into master DB.

This 'join' command just does a duc info on each DB and sums it up, and builds the tree. So if you index four DBs holding:

foo.db - /projects/foo bar.db - /projects/bar blah-one.db - /projects/blah/one blah-two.db - /projects/blah/two

It would sum up and present you with /projects and then links to the four DBs to then be browsed through. When you jumped into a different DB, it would open that new DB file and present the data.

The trick is getting back up the tree from /projects/foo into the merge.db properly. I guess we'd just need to have a 'global' DB when we encountered one of these join'd DBs. It would be a special case.

Lots to think about here. Have a good weekend! John

spikebike commented 3 years ago

Good news, it finished:

[------#-] Indexed 21.0Tb in 159.7M files and 52.6M directories
/projects/foo done Writing to database "/projects/foo/.duc-index/duc.db"

And :

$ tcbmgr inform ./duc.db
path: ./duc.db
database type: btree
additional flags:
comparison function: lexical
max leaf member: 256
max node member: 512
leaf number: 583608
node number: 1749
bucket number: 147451
alignment: 512
free block pool: 2048
inode number: 162133238280620164
modified time: 2020-10-02T12:16:55-07:00
options: large deflate
record number: 52608733
file size: 2165248000

From the debug log:

<< /projects/foo actual:22465400747513 apparent:23083084365824
Indexed 159694681 files and 52608730 directories, (20.4TB apparent, 21.0TB actual) in 9d, 22 hours, 59 minutes, and 49.54 seconds.
238:59:52 3%CPU

So it looks like your patch fixed it.

l8gravely> Would it be possible for you to change your setup to NOT have 320,000 l8gravely> directories in a single directory? Do the old NNTP trick of making a l8gravely> level or two of directories down so that you only have a few thousand l8gravely> directories inside any one directory?

We can provide feedback to the user, but it's up to them, ideally of course duc would walk any legal filesystem.

l8gravely> Yup, that's not the problem. But is your civid tree as regular as l8gravely> the problem tree?

It was smaller and less regular and scanned previous with no problem.

l8gravely> Unless your storage backend can handle it, it's not going to give you l8gravely> much gain. The big cost is the actual openddir(), readdir(), stat(), l8gravely> closedir() loop.

I think it can, it's the latency killing it, it's got 3 metadata servers and 48 objects storage, even when running 20 copies of duc in parallel I don't see a noticeable difference in the load.

l8gravely> Way back when I was first poking at this when it was Philesight and l8gravely> written in Ruby, I had hacked together a parallel descent version l8gravely> which did N parallel threads. This was before I realized that Ruby l8gravely> had the problem if the Global Thread Lock, so while you could write l8gravely> parallel code, it wasn't really parallel.

Heh, we've been down many of the same paths. I wanted a high performance crawler, first I tried java, only to find that I had to do weird things to get stat() working, it wasn't part of the language standard and I didn't want an external call to C. So rewrote it in Python, before I knew of the GIL, and hit a serious performance bottleneck. I then wrote it in Go and was quite pleased with the result.

l8gravely> Zevv re-wrote the core code into C which did speed up things quite a l8gravely> bit, but making it a multi-writer DB tool is alot more work. It might l8gravely> work if you had one writer, and multiple children which descended into l8gravely> a tree to read for you. But keeping it all in sync isn't trivial.

I was planning on just having a single database thread being fed by N walkers. Something like:

walker
   dir=getDirFromQueue
   foreach <itemInDir>
      if IsDir:
          if dir_not_queued_or_scanned(dir):
              queueDir(dir)
      if IsFile
          st=stat(file)
          queueFile(st.Size)

I've not benchmarked the database, but hopefully it's not the bottleneck.

l8gravely> Is there any way to break it down by Lustre filesystem?

Not sure I get exactly what you mean. Hopefully this answers your question. It's tricky, there's metadata servers, but they mostly track the filename, which dir it's in, and which object stores have the blocks. They do NOT track filesize. There is a Lustre change log and a tool called robinhood was commonly used to track filesizes. It wasn't perfect, errors could accumulate, but it was reasonable. Unfortunately the changelog format changed, and we also had some problems with robinhood not keeping up, the changelogs would accumulate, and the filesystem would go down.

l8gravely> How is your Lustre filesystem built? What is it based on for l8gravely> underlying filesystems? I suspect XFS if only because I think ext3/4 l8gravely> would simply bog down with 320,000 directories in a single directory. l8gravely> But I could be wrong here.

I'll have to check, I believe Lustre is relatively FS agnostic and I think the road map includes support for ZFS. But old ext2/3 filesystems did use a linked list which had major issues. But since then there's a commonly supported flag called dir_index which allows using a hashed btree (sometimes called an htree) which should handle large directories reasonably.

l8gravely> But as I've said before, splitting up your 250 projects into l8gravely> individual DBs gives your all the parallelism you need, then you have l8gravely> a single index script (see the examples/index.cgi) which takes all l8gravely> your DBs and generates the CGI files for each one, as well as the l8gravely> master index file you use to examine each project.

Sadly splitting it into 250 projects and some 100s of /scratch dirs isn't enough. A single /project dir took 9 days and we'd like to be able to scan more often than that. Having a separate DB per /project dir and a simple awk script can give as a total for all directories. So a parallel scanner, or merging directories in post processing so that something like /project/foo/up<1 to 100> could be scanned in parallel then merged in post processing could work.

The proposed duc join does seem like a reasonable approach. I could walk 246 of the 250 directories and put them all in their own duc.db, then for the directories that take more than a week to scan I cold do those in parallel into a database for each one, and then join them together afterwards. Interesting, I'll have to ponder.

l8gravely commented 3 years ago

Bill> Good news, it finished: Bill> [------#-] Indexed 21.0Tb in 159.7M files and 52.6M directories Bill> /projects/foo done Writing to database "/projects/foo/.duc-index/duc.db"

Nice! Glad that it finally finished. That's a crap load of directories and files too, about 2.5x my biggest collection. So the simple thing is to just apply those changes to the next release.

Thanks for your testing, it's been a big help.

Would you be willing to run some tests where we make a couple of different changes to see which one made the difference, or if we really need all of them?

While it's a simple patch, it's five different tweaks, but I suspect in your case it's the bnum, since the man page says:

`bnum' specifies the number of elements of the bucket array.  If it is
not more than 0, the default value is specified.  The default value is
16381.  Suggested size of the bucket array is about from 1 to 4 times
of the number of all pages to be stored.

So if you feel like changing the '131072' back to '-1' in the tcbdbtune() call in db-toyko.c and re-running the test, that might tell us something.

But in any case, I think this is probably a safe change to make since it shouldn't do anything but increase the overhead for smaller DBs, and let monster ones like yours actually work.

Unfortunately, the 'tcbmgr inform' doesn't really give us enough indication on what actually made the difference here. So for now, I'll push this change.

I really need to get my butt in gear to spin out a new release.

John

spikebike commented 3 years ago

Would you be willing to run some tests where we make a couple of different changes to see which one made the difference, or if we really need all of them?

In theory, yes. If each run takes 9 days I'm not sure it's worth it. I've yet to find time to look, but seems like something like:

#ifdef DEBUG
   sscanf( ... # get data from stdin>
#elseif 
   readdir(...
   stat(...
#endif 

I suspect a SSD could spew 160M file names and insert them into a DB in less than an hour instead of 9 days. Hrm, well maybe in a day anyways. I'll guesstimate 10-100x faster.

Which do you think would be easier to implement:

  1. parallel scanner combined with a sequential database thread. Each reader walks an entry from the directory list and appends to that list when it finds a directory. You'd need a thread safe queues, and a thread safe way to check if each directory was visited yet. Go makes this easy, I'm less sure of the best approach with c/c++. I've written pthreads code before, but nothing particularly complicated
  2. Manual parallelism with some post processing way to merge databases.

As for maximum sizes, my biggest single directory tree (out of 400) as measured by files:

/projects/foo1 files=   221966601 dirs=    52472953 sumBytesTB=    95 filesPerDir=     4 averageFileSizeMB=     0

By dirs:

  /projects/foo2 files=   159694681 dirs=    52608730 sumBytesTB=    20 filesPerDir=     3 averageFileSizeMB=     0

By size:

/projects/foo3 files=    25555665 dirs=     2458317 sumBytesTB=  1821 filesPerDir=    10 averageFileSizeMB=    74

Files per dir:

/projects/foo4 files=     4074022 dirs=          40 sumBytesTB=     0 filesPerDir=  101850 averageFileSizeMB=     0
spikebike commented 3 years ago

Hrm, I do think the patched do make the databases kinda of weird and broken if not updated.

For instance I ran a scan today and then:

# /nopt/nrel/utils/bin/duc info -d /projects/foo12/.duc-index/duc.db /projects/foo12
Date       Time       Files    Dirs    Size Path
2020-10-06 23:00:37   42.0M    6.8M    9.7T /projects/foo12
# ls -al /projects/foo12/.duc-index/duc.db
349777664 Oct 15 13:58 /projects/foo12/.duc-index/duc.db

I'll zero out all my databases and see if it acts normally.

l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> Hrm, I do think the patched do make the databases kinda of weird Bill> and broken if not updated. For instance I ran a scan today and Bill> then:

I must be suffering from -ENOCAFFEINE cause I'm not seeing the problem. Oh wait, did your /projects/novonutr come out as foo12 instead?

Bill> # /nopt/nrel/utils/bin/duc info -d /projects/foo12r/.duc-index/duc.db /projects/novonutr Bill> Date Time Files Dirs Size Path Bill> 2020-10-06 23:00:37 42.0M 6.8M 9.7T /projects/foo12 Bill> # ls -al /projects/foo12/.duc-index/duc.db Bill> 349777664 Oct 15 13:58 /projects/foo12/.duc-index/duc.db

Bill> I'll zero out all my databases and see if it acts normally.

Just to be clear, is this using the new version of duc with the updated DB settings, causing problems with older DBs created without those settings?

In any case, I need more details please.

spikebike commented 3 years ago

I was careful, but made some mistakes pasting in the output. Seems like the old duc (pre patched) makes a database that silently doesn't update with the patched duc. duc info and duc ls just report the old date and the rescans work without error, but the date never updates. Once I delete the databases it seems to work great.

spikebike commented 3 years ago

I wrote a scanner in Go, so far it just counts files, counts directories, and counts total bytes. It doesn't store anything in a database, but it is quite a bit faster:

$ time duc index /projects/foo/running/130/trashFolder/136
real    1m8.265s
user    0m16.402s
sys 0m28.423s

$ time go run dir1.go /projects/foo/running/130/trashFolder/136
Total Files=593566 Total Dirs=220418 Total Size=422181633387
real    0m6.541s
user    0m8.294s
sys 1m13.236s

$ time du -hs /projects/foo/running/130/trashFolder/136
395G    /projects/foo/running/130/trashFolder/136
real    0m45.030s
user    0m2.402s
sys 0m24.926s

Not sure how interesting that is though, it doesn't write to a database, and is only likely to be a significant speedup on parallel filesystems. There are tokoyo cabinet bindings for go, so a go scanner could still be compatible with the rest of duc. So far my implementation is 73 lines of go.

l8gravely commented 3 years ago

Bill> I wrote a scanner in Go, so far it just counts files, counts Bill> directories, and counts total bytes. It doesn't store anything Bill> in a database, but it is quite a bit faster:

Interesting! But you do need to be sure to drop any/all caches between tests. You need to do the following before each test, and ideally run the test three times and average the results.

echo 3 > /proc/sys/vm/drop_caches

And you will need to be root to do this.

Bill> $ time duc index /projects/foo/running/130/trashFolder/136 Bill> real 1m8.265s Bill> user 0m16.402s Bill> sys 0m28.423s

Bill> $ time go run dir1.go /projects/foo/running/130/trashFolder/136 Bill> Total Files=593566 Total Dirs=220418 Total Size=422181633387 Bill> real 0m6.541s Bill> user 0m8.294s Bill> sys 1m13.236s

Bill> $ time du -hs /projects/foo/running/130/trashFolder/136 Bill> 395G /projects/foo/running/130/trashFolder/136 Bill> real 0m45.030s Bill> user 0m2.402s Bill> sys 0m24.926s

Bill> Not sure how interesting that is though, it doesn't write to a Bill> database, and is only likely to be a significant speedup on Bill> parallel filesystems. There are tokoyo cabinet bindings for go, Bill> so a go scanner could still be compatible with the rest of Bill> duc. So far my implementation is 73 lines of go.

The duc format is pretty simple, so you might be able to write something that will interoperate, even if only for the generation of the index at first.

But I really want to see your numbers from a cold cache before we really take this any farther.

John

spikebike commented 3 years ago

With a parallel file system dropping caches is tricky. There's client side caches, metadata caches (on the meta data server), and cache on the object stores as well. To complicate things there's 2600 other nodes making requests of the parallel file system. I ran my code twice, the first with cold caches, the 2nd with warm, and then duc (also with a warm cache).

I'll try to quantify what differences dropping the client caches changes things ... later.

l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> With a parallel file system dropping caches is tricky. There's Bill> client side caches, metadata caches (on the meta data server), Bill> and cache on the object stores as well. To complicate things Bill> there's 2600 other nodes making requests of the parallel file Bill> system. I ran my code twice, the first with cold caches, the 2nd Bill> with warm, and then duc (also with a warm cache).

Can you explain what your setup is better? I really just want you to drop the caches on the linux system doing the indexing. I don't want to tell the storage to drop caches. Once they're big enough, they should spill no matter what.

Bill> I'll try to quantify what differences dropping the client caches Bill> changes things ... later.

So maybe run the test three times and average things. Or four times and drop the first run as you warm the caches.

John

spikebike commented 3 years ago

Well part of the problem is the node in question is running duc scans, the length of these scans (one running since Oct 15th) is why I'm looking for faster solutions:

$  ps acux | grep duc
root     24403  2.4  0.0  57488 32612 pts/3    S+   Nov09  79:56 duc
root     26617  1.0  0.0  72464 46036 pts/1    R    Oct15 412:06 duc
root     27126  3.0  0.0 119756 88712 pts/2    S+   Nov02 407:47 duc

It's one of 2600 or so clients of lustre, which has 3 meta data servers and 48 or so object stores. The file system is pretty much always busy.

Truely cold my code gets around 2 minutes:15 seconds for the benchmark dir. But I can only test that once, and can't easily flush the remote caches. If I run with warm caches:

real    0m8.124s
real    0m7.881s
real    0m7.909s

If I do this in a loop with the drop caches I get:

Total Files=742634 Total Dirs=273695 Total Size=380835906433
real    0m36.399s
Total Files=742634 Total Dirs=273695 Total Size=380835906433
real    0m28.817s
Total Files=742634 Total Dirs=273695 Total Size=380835906433
real    0m32.781s

If I do the same with duc, again with drop_caches immediately before every run:

for i in 1 2 3; do echo $i;  echo 3 > /proc/sys/vm/drop_caches; time duc index /projects/test/running/130/trashFolder/137; done
real    10m32.848s
real    14m53.847s
real    7m33.314s

That's more variation than usual, but you get the idea.

zevv commented 3 years ago

Hi Folks,

Sorry to drop into the discussion this late.

Quoting Bill Broadley (2020-11-04 11:32:40)

I wrote a scanner in Go, so far it just counts files, counts directories, and counts total bytes. It doesn't store anything in a database, but it is quite a bit faster:

It is significantly faster indeed. I assume you do your scanning parallel in multiple threads/goroutines, or is it just sequential like Duc is doing?

I'm 100% sure that for parallel file systems like Lustre, CEPH and whatever more, Duc is far from optional: it's scans the whole tree recursively from a single thread and thus makes no use of the inherent parallelism offerd by the underlying file system.

In the early days of Duc I made some attempts of proper parallelizing the scanning, but I felt the gains on "normal" file systems were not worth the additional complexity.

Can you share some more details about your scanner implemenation, possible sharing the source, or maybe show a few thousand lines of strace output from a run?

Thanks!

Ico

-- :wq ^X^Cy^K^X^C^C^C^C

spikebike commented 3 years ago

zevv> I'm 100% sure that for parallel file systems like Lustre, CEPH and zevv> whatever more, Duc is far from optional: it's scans the whole tree zevv> recursively from a single thread and thus makes no use of the inherent zevv> parallelism offerd by the underlying file system.

I did try breaking up a large (several PB) filesystem into 400 directories and running 30-40 copies of duc in parallel. For 95% or so of our directories it's fine. But for others I end up with week or even month long scans, one sarted on Oct 15th (28 days ago).

Lustre has metadata servers, but that just includes the filename, permissions, and which object stores the file is using. Sadly this does NOT include the filesize. So a simple stat() call for size requires client -> metadata server -> talk to object stores (OSTs). Clearly the latency is going to be higher with network traffic to multiple other servers for a simple stat, so a parallel approach really helps to increase throughput.

With warm caches I see things like:

THREADS  runtime
16   real   0m8.200s // 9.3x faster with 16x threads
  8   real  0m11.263s
  4   real  0m20.874s
  2   real  0m39.100s
  1   real  1m16.306s

For this reason my first attempt was:

for {
            list, err := file.Readdirnames(1024) // hoping to avoid a stat() call
            //    fmt.Println(err)
            for _, name := range list {
               fullName := dirName + "/" + name
               fileNames <- fullName // send name to go channel to have workers run stat()
            }
         }

Unfortunately that wasn't any faster than the simpler:

      for {
         list, err := dir.Readdir(256) // 0 to read all files and folders
         if err != nil {
            break
         }
         for _, file := range list {
            totalSize += file.Size()
            if file.IsDir() {
               totalDirs++
               dirNames <- dirName + "/" + file.Name() // queue any found dirs
            } else {
               totalFiles++
            }
         }
      }

To see the above in a working example https://github.com/spikebike/go-dirscan

zevv commented 3 years ago

So a simple stat() call for size requires client -> metadata server -> talk to object stores (OSTs).

Right, and that kind of latency really adds up.

Part of the problem I ran into with parallelizing the scan as a whole is that at some point the subtrees meet and have to be consolidated in the common root. Traversing the FS in parallel is not hard, but keeping the total bookkeeping is harder.

Part of my problem is that threading and parallelization is just a pain-in-the-ass in C, in general. I got this to work one time, but the resulting code was just getting too complicated. I guess the scanner logic could greatly benefit from being written in a language with better threading primititives - like Go.

If the stat()ing is really the main problem, I can see a relatively clean solution where Duc could introduce a threadpoold only for doing the stats in parallel. So in this case one directory will still be traversed at any moment, all entries will be retreived with readdir(), and then all the indidual lstat() calls will be performed on N threads in parallel. I guess this also helps on non-distributed network file systems as it just parallelizes the server roundtrip time.

For this reason my first attempt was: [...]

Unfortunately that wasn't any faster than the simpler: [...]

I think we could learn a lot by properly measuring and/or tracing the individual libc/system calls that are ran during a parallel scan to see where the low hanging fruit is.

That said: I stalled my active development on Duc for a long time, mostly because it Works For Me, and I kind of lost interest in the project. John is a teriffic help for me, as he actually takes the time and effort to respond to most (or even all?) of the Github issues. I'm not sure if I can find the time and motivation to really work on this. But if we thing the solution above (parallizing the stats only) would be enough help, I think we can give that a try, as that would be a nice local change affecting only index.c:scanner_scan()

-- :wq ^X^Cy^K^X^C^C^C^C

spikebike commented 3 years ago

zevv> Part of the problem I ran into with parallelizing the scan as a whole is that at some point the subtrees meet and have to be consolidated in the common root. Traversing the FS in parallel is not hard, but keeping the total bookkeeping is harder.

Not really, just queue one dir, then as you find new dirs queue them, that's it. No testing if you've seen them, no merging trees, no consolidation. Hrm, my code doesn't test for links, I should just ignore them. After all it's not strictly a tree with links.

zevv> If the stat()ing is really the main problem, I can see a relatively clean solution where Duc could introduce a threadpoold only for doing the stats in parallel. So in this case one directory will still be traversed at any moment, all entries will be retreived with readdir(), and then all the indidual lstat() calls will be performed on N threads in parallel.

Several of our problematic directories have something like 3-4 files per directory, so I didn't want to stick to a single directory at time.

zevv>That said: I stalled my active development on Duc for a long time, mostly because it Works For Me, and I kind of lost interest in the project. John is a teriffic help for me, as he actually takes the time and effort to respond to most (or even all?) of the Github issues. I'm not sure if I can find the time and motivation to really work on this. But if we thing the solution above (parallizing the stats only) would be enough help, I think we can give that a try, as that would be a nice local change affecting only index.c:scanner_scan()

Very fair, totally makes sense. For the common case the parallel scan is a waste, on my desktop with a fast NVMe (Samsung 950 Pro):

left:~/git/go-dirscan$ time go run dirscan.go .. 8; time du -hs ..
workers = 8
Total Files=1074232 Total Dirs=994209 Total Size=27519874975

real    0m7.369s
user    0m21.723s
sys 0m30.572s
26G ..

real    0m9.901s
user    0m4.233s
sys 0m5.595s

So clearly not much of a speedup (7.4 seconds for dirscan, 9.9 seconds for du) is not worth any additional complexity. I think the best way forward is to add the data I collect directly to the same backend duc uses. That way anyone with a similar problem can use this scanner and still use the other duc tools.

Any idea which of the database backends would be fastest for writes (I.e. fastest for writing scan metadata)? I believe it was mentioned that tokyocabinet is not under active development, but kyotocabinet is.

spikebike commented 3 years ago

Just noticed the INSTALL file:

----------------------------------
 Database        Run time Db size
                      (s)    (kB)
----------------------------------
 tokyocabinet [*]     8.4   19.2
 leveldb              7.1   31.5
 sqlite3             13.5   71.1
 lmdb                 5.9   78.7
 kyotocabinet         8.3   26.7
l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

zevv> I'm 100% sure that for parallel file systems like Lustre, CEPH and zevv> whatever more, Duc is far from optional: it's scans the whole tree zevv> recursively from a single thread and thus makes no use of the inherent zevv> parallelism offerd by the underlying file system.

Bill> I did try breaking up a large (several PB) filesystem into 400 Bill> directories and running 30-40 copies of duc in parallel. For 95% Bill> or so of our directories it's fine. But for others I end up with Bill> week or even month long scans, one in mid Oct (28 days ago).

Ouch, you're in a different league than my needs, I only had approx 300tb to index. Sheer number of files is what slows down things the most.

Bill> Lustre has metadata servers, but that just includes the Bill> filename, permissions, and which object stores the file is Bill> using. Sadly this does NOT include the filesize. So a simple Bill> stat() call for size requires client -> metadata server -> talk Bill> to object stores (OSTs). Clearly the latency is going to be Bill> higher with network traffic to multiple other servers for a Bill> simple stat, so a parallel approach really helps to increase Bill> throughput.

Does Lustre spread directory entries out across object stores, or does it group a directory onto one or a few object stores? That would impact the scaning algorithm for sure.

Bill> With warm caches I see things like:

Bill> THREADS runtime Bill> 16 real 0m8.200s // 9.3x faster with 16x threads Bill> 8 real 0m11.263s Bill> 4 real 0m20.874s Bill> 2 real 0m39.100s Bill> 1 real 1m16.306s

Bill> For this reason my first attempt was:

That's a big imrovement! For Lustre (and possibly Ceph, and maybe NFS on Netapps) that would be a big improvement.

Bill> for { Bill> list, err := file.Readdirnames(1024) // hoping to avoid a stat() call Bill> // fmt.Println(err) Bill> for _, name := range list { Bill> fullName := dirName + "/" + name Bill> fileNames <- fullName // send name to go channel to have workers run stat() Bill> } Bill> }

Bill> Unfortunately that wasn't any faster than the simpler:

Bill> for { Bill> list, err := dir.Readdir(256) // 0 to read all files and folders Bill> if err != nil { Bill> break Bill> } Bill> for _, file := range list { Bill> totalSize += file.Size() Bill> if file.IsDir() { Bill> totalDirs++ Bill> dirNames <- dirName + "/" + file.Name() // queue any found dirs Bill> } else { Bill> totalFiles++ Bill> } Bill> } Bill> }

Bill> To see the above in a working example https://github.com/spikebike/go-dirscan

Thanks for sharing this. Now how hard is it to get your go code and Ico's C code to interoperate nicely? The locking of having a seperate indexer process might get tricky if you're reading the DB at the same time.

This is the update problem too, which is why all most scripts index to a new DB each time, and then move it into place so the end users can look at it with the CGI scripts.

Way way back when I did a hack on philesight to fire off a new thread for each directory as it encountered it, upto a maximum. This was before I realized that Ruby still had a single global thread lock. But it was a nice way to scan the directory tree with multiple threads, and keeping them busy pretty much all the way through.

Since duc does depth-first scanning, each time you hit a directory, just send a new thread scanning down into that directory, since you need to wait for the totals to come back up before you can continue.

So I think you code above needs to actually be more recursive in design, with a thread count somewhere. I haven't touched Go at all, so I don't know how that would work.

Some quick googling (and no looking at your code!) seems to imply that sync.Waitgroups might be a good way to scale numbers of readers up.

I'll have to mull this over more... and maybe learn some Go.

The trick will be writing the DB in a consistent manner so that it can be read by the existing code properly. If that issue gets solved, then version 2.0 of duc will have to come out! Or maybe this would be a seperate project called 'pduc' or something like that. :-)

The trick will be giving the end user enough rope to be able to scale up nicely, but not enough for them to hang themselves. :-)

Also, speaking of backend DBs, Ico and I have been talking on and off about reducing the number of supported backends, esp since tokyocabinet is abandonware, kyotocabinet also was getting a little less love, and sqlite just doesn't scale well at all.

Leveldb seemed to be where we would end up going eventually, probably with the v2.0 release, since I'm planning on cutting a v1.5 release with some fixes, better docs, more examples, more tests, etc.

Just haven't had the time to do it, since I wanted to spin up some VMs with Vagrant to do my test builds on Debian Buster, CentOS, Fedora and Suse images. That led me down the Vagrant rathole of course. LOL!

John

spikebike commented 3 years ago

l8gravely> Thanks for sharing this. Now how hard is it to get your go code and Ico's C code to interoperate nicely? The locking of having a separate indexer process might get tricky if you're reading the DB at the same time.

I've not played with calling C from Go. Not sure why I'd need to read from the DB. Just launch for a particular directory, walk it completely, writing a new DB as you go. I was only expecting to write to the database from a single thread. Each scanner would write pathname and filesize to a channel, and the single database writer would read from that channel.

l8gravely> Way way back when I did a hack on philesight to fire off a new thread for each directory as it encountered it, upto a maximum. This was before I realized that Ruby still had a single global thread lock. But it was a nice way to scan the directory tree with multiple threads, and keeping them busy pretty much all the way through.

The go code launches the number of workers you specify as the second argument, but does not wait for subdirs to finish. I had thought maybe the database layer was hierarchical and handled the summing. My current code just produces a list of files and their sizes. It gets the right totals, but no easy way to display the total for a given subdirectory. I'll have to ponder the best way to get per directory totals.

l8gravely> The trick will be writing the DB in a consistent manner so that it can be read by the existing code properly. If that issue gets solved, then version 2.0 of duc will have to come out! Or maybe this would be a separate project called 'pduc' or something like that. :-)

Sure, was thinking of just asking for a mention in the README along the lines of "If using a parallel filesystem go-dirscan writes a compatible database". Certainly if you want to include it in duc under CONTRIB or something, whatever you think best. I did play with recompiling duc with SQLLITE and trying to figure out the DB format by inspection. Don't really get it so far, so I'll have to look closer to see how to write a compatible database from dirscan. I was expecting something like for each record. Although now that I realize that I need to collect the totals I'll have to make some changes to dirscan so I can totals for each directory, not just each file. Not sure recursion is best or more channels.

l8gravely> Also, speaking of backend DBs, Ico and I have been talking on and off about reducing the number of supported backends, esp since tokyocabinet is abandonware, kyotocabinet also was getting a little less love, and sqlite just doesn't scale well at all.

I found SQLLITE quite useful to understand what's going on. You can dump the database, examine the schema, it's crazy portable, robust, and well maintained. I'd advocate keeping that one just for reference purposes. Something to benchmark how much faster everything else is ;-). Based on the perf numbers in the INSTALL file I was going to target LMDB, seems well maintained, fast, popular, and had bindings for many languages, including Go. It supports parallel readers, single writer, and seems pretty slick and it's special append mode sounds particularly promising. Would you think LevelDB might be faster than LMDB? A random quote from Mozilla.github.io "In a run of this benchmark, LMDB was approximately an order of magnitude faster than LevelDB to open a database and read entries, while being roughly equivalent to 3x faster to write entries (depending on the type of write)".

l8gravely> Just haven't had the time to do it, since I wanted to spin up some VMs with Vagrant to do my test builds on Debian Buster, CentOS, Fedora and Suse images. That led me down the Vagrant rathole of course. LOL!

Seen SPACK? Seems to be picking up steam, makes it easy to do builds for different environments, and even has support for things like building inside containers for multiple environments.

l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

l8gravely> Thanks for sharing this. Now how hard is it to get your go l8gravely> code and Ico's C code to interoperate nicely? The locking l8gravely> of having a separate indexer process might get tricky if l8gravely> you're reading the DB at the same time.

Bill> I've not played with calling C from Go. Not sure why I'd need to Bill> read from the DB. Just launch for a particular directory, walk Bill> it completely, writing a new DB as you go. I was only expecting Bill> to write to the database from a single thread. Each scanner Bill> would write pathname and filesize to a channel, and the single Bill> database writer would read from that channel.

l8gravely> Way way back when I did a hack on philesight to fire off a new thread for each Bill> directory as it encountered it, upto a maximum. This was before I realized that Ruby still had a Bill> single global thread lock. But it was a nice way to scan the directory tree with multiple threads, Bill> and keeping them busy pretty much all the way through.

Bill> The go code launches the number of workers you specify as the Bill> second argument, but does not wait for subdirs to finish. I had Bill> thought maybe the database layer was hierarchical and handled Bill> the summing. My current code just produces a list of files and Bill> their sizes. It gets the right totals, but no easy way to Bill> display the total for a given subdirectory. I'll have to ponder Bill> the best way to get per directory totals.

No, the DB layer doesn't actually do anything but hold totals for each level, and the levels below it. It's a very simple key based DB format. It should be trivial to write to it from different tools in a consistent manner.

l8gravely> The trick will be writing the DB in a consistent manner so l8gravely> that it can be read by the existing code properly. If that l8gravely> issue gets solved, then version 2.0 of duc will have to l8gravely> come out! Or maybe this would be a separate project called l8gravely> 'pduc' or something like that. :-)

Bill> Sure, was thinking of just asking for a mention in the README Bill> along the lines of "If using a parallel filesystem go-dirscan Bill> writes a compatible database". Certainly if you want to include Bill> it in duc under CONTRIB or something, whatever you think best. I Bill> did play with recompiling duc with SQLLITE and trying to figure Bill> out the DB format by inspection. Don't really get it so far, so Bill> I'll have to look closer to see how to write a compatible Bill> database from dirscan. I was expecting something like for each Bill> record. Although now that I realize that I need to collect the Bill> totals I'll have to make some changes to dirscan so I can totals Bill> for each directory, not just each file. Not sure recursion is Bill> best or more channels.

You need to use recursion, since each directory has to sum the total of all the directories underneath it. It's really simple in terms of scanning, since it's basically:

  1. start in a directory, read all the entries.
  2. if you find a directory, scan into it and wait until you get a total.
  3. if it's a file, just add it to the directory total
  4. when done, return the size of all objects in the directory.

So it's just a depth-first directory traversal, which is very easy to scan in parallel, since each directory should be independent. Note that sym-links aren't followed, so that mess is avoided. And hardlinks are recognized by the inode being the same, which is true of most (all?) POSIX filesystems.

l8gravely> Also, speaking of backend DBs, Ico and I have been talking l8gravely> on and off about reducing the number of supported backends, l8gravely> esp since tokyocabinet is abandonware, kyotocabinet also l8gravely> was getting a little less love, and sqlite just doesn't l8gravely> scale well at all.

Bill> I found SQLLITE quite useful to understand what's going on. You Bill> can dump the database, examine the schema, it's crazy portable, Bill> robust, and well maintained. I'd advocate keeping that one just Bill> for reference purposes.

Maybe, but if you look at the DB, there's no actual schema in use really (famous last words, I haven't looked deeply at that code recently. I'd suggest you use lmdb, leveldb or kyotocabinet for your tests. Which ever works in Go for you. Then if that really works well, we would probably think about moving duc in general to just that single backend.

Bill> Something to benchmark how much faster everything else is Bill> ;-). Based on the perf numbers in the INSTALL file I was going Bill> to target LMDB, seems well maintained, fast, popular, and had Bill> bindings for many languages, including Go. It supports parallel Bill> readers, single writer, and seems pretty slick and it's special Bill> append mode sounds particularly promising. Would you think Bill> LevelDB might be faster than LMDB? A random quote from Bill> Mozilla.github.io "In a run of this benchmark, LMDB was Bill> approximately an order of magnitude faster than LevelDB to open Bill> a database and read entries, while being roughly equivalent to Bill> 3x faster to write entries (depending on the type of write)".

If you want to go with LMDB in your testing, please do. We're actually pretty flexible. Some of my recent work has been making duc more tolerant of DBs in non-compiled in formats, esp since we don't support multiple backends in one binary. So honestly, moving to one backend isn't the end of the world in my view. We would first make it the new default, but still support old versions.

l8gravely> Just haven't had the time to do it, since I wanted to spin l8gravely> up some VMs with Vagrant to do my test builds on Debian l8gravely> Buster, CentOS, Fedora and Suse images. That led me down l8gravely> the Vagrant rathole of course. LOL!

Bill> Seen SPACK? Seems to be picking up steam, makes it easy to do Bill> builds for different environments, and even has support for Bill> things like building inside containers for multiple Bill> environments.

Another rat-hole to dive down it seems! My main machine at home is an old AMD Phenom II X4 945 processor, it's probably ten years old now, but it works for what I need. But damn those new AMD ones look nice, esp with 32gb of RAM so I can run a bunch of VMs nicely.

l8gravely commented 3 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Sorry to drop into the discussion this late.

No no, not a problem! I've been super busy lately too, which is why I'm replying out of order.

Ico> Quoting Bill Broadley (2020-11-04 11:32:40)

I wrote a scanner in Go, so far it just counts files, counts directories, and counts total bytes. It doesn't store anything in a database, but it is quite a bit faster:

Ico> It is significantly faster indeed. I assume you do your Ico> scanning parallel in multiple threads/goroutines, or is it just Ico> sequential like Duc is doing?

Ico> I'm 100% sure that for parallel file systems like Lustre, CEPH Ico> and whatever more, Duc is far from optional: it's scans the whole Ico> tree recursively from a single thread and thus makes no use of Ico> the inherent parallelism offerd by the underlying file system.

Right, recursively scanning with multiple threads isn't hard as long as you don't leave a directory until all threads have finished and returned their totals.

Ico> In the early days of Duc I made some attempts of proper Ico> parallelizing the scanning, but I felt the gains on "normal" file Ico> systems were not worth the additional complexity.

Yup, I did this will philesight in ruby too, but it didn't scale for Ruby reasons.

But I think the basic idea is simple, say I have 5 threads, when I do a readdir() to find entries, just fork off a thread if you're under the total for each directory entry you encounter, and just wait until it returns with a size.

Wait... shit. I'm stupid. It's inserting the data into the DB in parallel which makes this hard. Dammit... I'm too tired to think about this properly, but I know this is a solved problem in some space.

I guess running parallel scanners which feed into a single writer, and the scanners go in a depth first manner is the way to do it. You still need to go depth first, because you can't assume the directory tree is balanced, and so the master process will have to just insert into the DB in a consistent way. Hmmm...

But since it's just a key/value store, and the key contrains the full path... it shouldn't be hard. I gotta spend more time looking at the core code. I generally just spend time around the edges with the docs, tests, packaging, etc.

John

spikebike commented 3 years ago

I've made some progress, not quite working, but I think avoiding recursion makes it a bit more straight forward, my approach is:

  1. start with a single directory, each directory discovered goes into the queue of dirs (dirQueue) to be scanned

  2. Up to a limit (I use around 20) processes listen to the queue and:

          while <dirQueue>
          {
               while<readdir>
               {
                      fs.stat()
                      if fs.IsDir() {
                               send Dir -> dirQueue # so next idle worker can start scanning
                       }
                }
               Send DirFileTotal DirSizeTotal -> SumQueue
            }
  3. A single SumDir will track all the directory totals, add them up, and write them to a database. The nice thing is the data is already great simplified since most of the work (all the per file stat() calls), and only the directories are reported. No disk I/O to slow things down, except for appending to the database and LMDB seems to have a streaming sequential append mode that should avoid expensive seeks. I expect SumDir to easily handle the output from 20 scanners (that are mostly calling stat() on files).

l8gravely commented 3 years ago

"Bill" == Bill Broadley notifications@github.com writes:

Bill> I've made some progress, not quite working, but I think avoiding Bill> recursion makes it a bit more straight forward, my approach is:

Bill> 1. start with a single directory, each directory discovered Bill> goes into the queue of dirs (dirQueue) to be scanned

I don't think this will work, you need to descend into a directory and all the way down to the end of the chain before you return size data back up the chain.

You're not going to return any data until you reach the bottom of the chain and start ascending back up.

Bill> 2. Up to a limit (I use around 20) processes listen to the queue and:

Bill> while Bill> { Bill> while Bill> { Bill> fs.stat() Bill> if fs.IsDir() { Bill> send Dir -> dirQueue # so next idle worker can start scanning Bill> } Bill> } Bill> Send DirFileTotal DirSizeTotal -> SumQueue Bill> }

Bill> 3. A single SumDir will track all the directory totals, add Bill> them up, and write them to a database. The nice thing is the Bill> data is already great simplified since most of the work (all Bill> the per file stat() calls), and only the directories are Bill> reported. No disk I/O to slow things down, except for Bill> appending to the database and LMDB seems to have a streaming Bill> sequential append mode that should avoid expensive seeks. I Bill> expect SumDir to easily handle the output from 20 scanners Bill> (that are mostly calling stat() on files).

Right, so assume you have a two readers, and one writer, and the following tree. How does your code work on it? I need to pull down your code and learn Go I suspect if I'm really going to help here. grin

Anyway a directory tree like this, each directory has a single 1mb file, but one or more directories.

/foo/1/a/ /foo/1/b /foo/1/c /foo/1/d /foo/1/e/bar/gar/g1 /g2 /g3 /foo/2/a/dog/ cat/ mouse/ deer/ /foo/2/b/lego/ /blocks/ /plastic/ /him/4/beer /pizza /bacon /burger/pickles /fries /condiments/ketchup/heinz/57/

So you're algorithm would start a /, then find the foo/ and then foo 1/ and both threads would then block, right? Because you need to go down to /foo/1/a, then /foo/1/b, then eventually to /foo/1/e/...., but you can't figure out the /foo/1/e/ until you sum up the g1,g2,g3 directories first.

This is why it needs to be a recursive algorithm, because you sum up from the bottom of the tree, not from the top down.

This link looks to be possibly what is needed here:

https://blog.golang.org/pipelines

Especially the last example which is bounded. But it's not quite the right example either, because it doesn't explicitly walk the tree from the bottom back up, returning the size of the nodes (directories).

John

spikebike commented 3 years ago

l8gravely> I don't think this will work, you need to descend into a directory and all the way down to the end of the chain before you return size data back up the chain.

Yes, for that approach. I wanted to keep A) maximize performance which means maximizing locality, avoiding pausing one dir, and doing another dir, them coming back, and going to a 3rd dir .... B) handle large directories (at least 75M directories). So to avoid those problems I:

   1) Walk each dir from begging to end and get a total files/dirs for that directory
   2) send the totals to a pipeline
   3) build a tree for the totals in memory
   4) walk the tree for the totals

l8gravely> You're not going to return any data until you reach the bottom of the chain and start ascending back up.

Much simpler, just open dir, call readdir till the end, then report the totals. Let the other threads read from the pending queue of directories. No recursion, no waiting, no maximum stack size, MUCH less random IO, etc.

l8gravely> Right, so assume you have a two readers, and one writer, and the following tree. How does your code work on it? I need to pull down your code and learn Go I suspect if I'm really going to help here. grin

Say you have 3 directories, something like

/a/b <dir>
/a/b/1
/a/b/c <dir>
/a/b/2
/a/b/3
/a/b/c/4
/a/b/c/5

So thread 1 opens dir "a/b", stat() file 1, reads subdir c, appends "a/b/c" to the directory queue, stat() file 2, stat() file 3, closes the directory, then reports the totals. When thread 1 is done, it waits for a new directory from the directory queue.

Thread 2 reads from the directory queue and reads "a/b/c" then stat() file 4 and file 5. Any found dirs get queued. When done it reports the results and goes back to listening to the directory queue.

l8gravely> Example directory tree

The whitespace was hard to tell, but my best guess was:

/foo/1/a
/foo/1/b
/foo/1/c
/foo/1/d
/foo/1/e/bar/gar/g1
/foo/1/e/bar/gar/g2
/foo/1/e/bar/gar/g3
/foo/2/a/dog
/foo/2/a/cat
/foo/2/a/mouse
/foo/2/a/deer
/foo/2/b/lego
/foo/2/b/lego/blocks
/foo/2/b/lego/plastic
/him/4/beer
/him/4/bacon
/him/4/burgers
/him/4/burgers/pickles
/him/4/burgers/fries
/him/4/burgers/condiments/ketchup/heinz/57

I'm happy to say it looks like it works, here's my debugging output:

Walking
size= 20 depth=1 fullpath=/
size= 14 depth=2 fullpath=/foo
size=  7 depth=3 fullpath=/foo/1
size=  1 depth=4 fullpath=/foo/1/a
size=  1 depth=4 fullpath=/foo/1/b
size=  1 depth=4 fullpath=/foo/1/c
size=  1 depth=4 fullpath=/foo/1/d
size=  3 depth=4 fullpath=/foo/1/e
size=  3 depth=5 fullpath=/foo/1/e/bar
size=  3 depth=6 fullpath=/foo/1/e/bar/gar
size=  1 depth=7 fullpath=/foo/1/e/bar/gar/g1
size=  1 depth=7 fullpath=/foo/1/e/bar/gar/g2
size=  1 depth=7 fullpath=/foo/1/e/bar/gar/g3
size=  7 depth=3 fullpath=/foo/2
size=  4 depth=4 fullpath=/foo/2/a
size=  1 depth=5 fullpath=/foo/2/a/dog
size=  1 depth=5 fullpath=/foo/2/a/cat
size=  1 depth=5 fullpath=/foo/2/a/mouse
size=  1 depth=5 fullpath=/foo/2/a/deer
size=  3 depth=4 fullpath=/foo/2/b
size=  3 depth=5 fullpath=/foo/2/b/lego
size=  1 depth=6 fullpath=/foo/2/b/lego/blocks
size=  1 depth=6 fullpath=/foo/2/b/lego/plastic
size=  6 depth=2 fullpath=/him
size=  6 depth=3 fullpath=/him/4
size=  4 depth=4 fullpath=/him/4/burgers
size=  1 depth=5 fullpath=/him/4/burgers/pickles
size=  1 depth=5 fullpath=/him/4/burgers/fries
size=  1 depth=5 fullpath=/him/4/burgers/condiments
size=  1 depth=6 fullpath=/him/4/burgers/condiments/ketchup
size=  1 depth=7 fullpath=/him/4/burgers/condiments/ketchup/heinz
size=  1 depth=8 fullpath=/him/4/burgers/condiments/ketchup/heinz/57
size=  1 depth=4 fullpath=/him/4/beer
size=  1 depth=4 fullpath=/him/4/bacon

It get the total=20 which I believe is the right answer. I plan to make a fake load of 50M directories and see how it does on memory, speed, and accuracy and speed. My best guess on memory overhead is something like the average length of a directory name (likely less than 16 characters) + pointer (8 bytes) + 2 counters (files and size for 16 bytes. So even 75M dirs would be under 3GB, won't know for sure until I test.

l8gravely> So you're algorithm would start a /, then find the foo/ and then foo 1/ and both threads would then block, right? Because you need to go down to /foo/1/a, then /foo/1/b, then eventually to /foo/1/e/...., but you can't figure out the /foo/1/e/ until you sum up the g1,g2,g3 directories first.

There's no such thing as an algorithm that requires recursion. My approach would once it reaches steady state it would just run 20 threads pulling 20 things off the queue, each one would completely finish the directory it's working on before looping back and waiting on another directory.

Then once all directories are scanned, and all the totals are in an in memory tree, it would walk the in memory tree to get accurate totals per directory. I suspect that's near work, I expect even 50M directories to finish in seconds if not less.

spikebike commented 3 years ago

Just did a load test, here's my generator (not talking to a go channel ... yet):

func main() {
   for i:=0;i<40;i++ {
      for j:=0;j<350000;j++ {
         fmt.Printf("d/up%02d/bldg%07d\n",i,j)
         fmt.Printf("d/up%02d/bldg%07d/run\n",i,j)
      }
   }
}

It generates 28M directories, and is one of the ones causing serious issues for duc. It test case takes about 32 seconds to A) read 28M lines of input, B) parse them for directory names C) build an in memory tree of 28M nodes D) walk the tree to get the per directory totals E) output said totals ... including the right answer for the grand total. Considering Duc has been running since Oct 15th, I think I'll consider all of the above about 0 in the grand scheme of things.

The above models what I need to do when each dir scanner finished a directory and sends the totals to the sumWorker which is the one that builds the tree, walks the tree for the totals, and writes it to a DB.

My test does hit 9GB of ram which is a bit more than I like, but each node has a parent pointer that I don't think I need, which would save some memory. But most of it is a map of strings (that part of the path name) to other nodes of the tree.

l8gravely commented 4 days ago

Yikes! I'm sorry I dropped the ball on this for so long. I've just released v1.5.0-rc1, which is mostly for implementing a new backend DB and deprecating the others. tkrzw has been shown to handle larger file systems better than tokyocabinet.

As for the parallel scanning... I think right now the best bet is to manually split up your monster lustre filesystem into smaller chunks. I'm no closer to learning Go for programming, and no closer in taking the core indexing code and making it even slightly parallel. It might be possible, but I don't have any parallel filesystem to test against.