zevv / duc

Dude, where are my bytes: Duc, a library and suite of tools for inspecting disk usage
GNU Lesser General Public License v3.0
579 stars 78 forks source link

Indexing algorithm #161

Open jbd opened 7 years ago

jbd commented 7 years ago

Hello,

this is more a suggestion than an issue.

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood also written in C, if you don't already know the project:

To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).

from Taking back control of HPC file systems with Robinhood Policy Engine paper. See here for more.

If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.

zevv commented 7 years ago

Quoting jbd (2016-12-27 22:38:12)

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood ([1]https://github.com/cea-hpc/robinhood) also written in C, if you don't already know the project:

Thanks for that. I didn't know about Robnhood, but it seems that they have taken the idea a few steps further already, it's like Duc on mass steroids with a hundred additional features; these people are way ahead of me.

Robinhood seems to have a different audience then Duc though; it is not a simple tool you can setup in a few minutes, but it requires some experience and knowledge to configure and run. I guess they pick up where Duc hits its limits :)

To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).

Parallelization is on my wish list for a long time (second to incremental indexing), I'm sure a lot of users will benefit greatly from this. On a single spindle and a modern Linux kernel multiple threads will not add too much performance, but as soon as multiple physical disks are involved doing things in parallel will speed up things a lot, especially on Lustre-scale file systems.

I have made a proof of concept of a parallel indexing algorithm for Duc a few years ago. The implementation was close to your description above, but I was not too happy with the increased complexity of the resulting code, so it never made it into Duc in the end. I will definitely take a look at the internals of robinhood to see if I can steal any good ideas!

If the storage could handle parallel requests nicely, that's a huge win.

I guess that even with no concurrency on the database the speedup will still be significant, since the overhead of reading metadata from a file system is so much higher then writing out the results to the database.

Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.

I would be very interested in some numbers that compare Duc to Robinhood though on a file system this size; if you ever get some numbers on this, please let me know!

-- :wq ^X^Cy^K^X^C^C^C^C

l8gravely commented 7 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Quoting jbd (2016-12-27 22:38:12)

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood ([1]https://github.com/cea-hpc/robinhood) also written in C, if you don't already know the project:

Ico> Thanks for that. I didn't know about Robnhood, but it seems that they Ico> have taken the idea a few steps further already, it's like Duc on mass Ico> steroids with a hundred additional featuresr; these people are way ahead Ico> of me.

I've looked at this too, and I think they have a crucial flaw, which is using Mysql as the backend, instead of some dedicated and optimized DB like Duc is using. Or even what philesight used, and duc was a big performance and size improvement there.

As for the parallel code, it will help up until the backend (NFS on Netapp in my case mostly) hits it's limits I fear.

Ico> Robinhood seems to have a different audience then Duc though; it is not Ico> a simple tool you can setup in a few minutes, but it requires some Ico> experience and knowledge to configure and run. I guess they pick up Ico> where Duc hits its limits :)

And they gloss over the performance issues of mysql in a big way!

To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).

Ico> Parallelization is on my wish list for a long time (second to Ico> incremental indexing), I'm sure a lot of users will benefit Ico> greatly from this. On a single spindle and a modern Linux kernel Ico> multiple threads will not add too much performance, but as soon Ico> as multiple physical disks are involved doing things in parallel Ico> will speed up things a lot, especially on Lustre-scale file Ico> systems.

Ico> I have made a proof of concept of a parallel indexing algorithm Ico> for Duc a few years ago. The implementation was close to your Ico> description above, but I was not too happy with the increased Ico> complexity of the resulting code, so it never made it into Duc in Ico> the end. I will definitely take a look at the internals of Ico> robinhood to see if I can steal any good ideas!

I did the same with Philesight in ruby, but ran into the big kernel lock issue of Ruby threading. I don't know if they've fixed that in newer versions or not.

If the storage could handle parallel requests nicely, that's a huge win.

Ico> I guess that even with no concurrency on the database the speedup Ico> will still be significant, since the overhead of reading metadata Ico> from a file system is so much higher then writing out the results Ico> to the database.

Execpt if you do a crappy mysql setup... In my view, the use case is so specialized, that doing a dedicated DB tool makes the most sense. Using a general SQL DB for storage just doesn't scale. Look at the crap that Bacula goes through.

Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.

Ico> I would be very interested in some numbers that compare Duc to Ico> Robinhood though on a file system this size; if you ever get some Ico> numbers on this, please let me know!

I've just compiled robinhood 3.0, but I need to find and setup a mysql host first. Hopefully I'll get you some numbers in the next day or so.

John

jbd commented 7 years ago

I've looked at this too, and I think they have a crucial flaw, which is using Mysql as the backend

I tend to think like you, but I'm not a database expert. And if you want to run stuff like that on huge file system, you'll have to build a solid database backend. The people behind robinhood seems to be quite involved and active on this matter (you can have a look at their mailing list) and the program is used in production on multi-petabytes storage system. But again, I think that duc hits a sweet spot by using a KV backend. And my suggestion was only concerning the crawling algorithm =)

As for the parallel code, it will help up until the backend (NFS on Netapp in my case mostly) hits it's limits I fear.

Yes, no magic here. But the speedup optained while hammering/breaking the storage system could be noticeable.

zevv commented 7 years ago

Quoting John (2016-12-28 17:35:04)

I've just compiled robinhood 3.0, but I need to find and setup a mysql host first. Hopefully I'll get you some numbers in the next day or so.

Great, thanks for that!

-- :wq ^X^Cy^K^X^C^C^C^C

l8gravely commented 7 years ago

"jbd" == jbd notifications@github.com writes:

jbd> I've looked at this too, and I think they have a crucial flaw, which is jbd> using Mysql as the backend

jbd> I tend to think like you, but I'm not a database expert. And if jbd> you want to run stuff like that on huge file system, you'll have jbd> to build a solid database backend. The people behind robinhood jbd> seems to be quite involved and active on this matter (you can jbd> have a look at their mailing list) and the program is used in jbd> production on multi-petabytes storage system. But again, I think jbd> that duc hits a sweet spot by using a KV backend. And my jbd> suggestion was only concerning the crawling algorithm =)

That's true, the crawling aglo was the major thrust, but in alot of ways, it's not even that hard to do really. The hard part is the inserts into the DB in a consistent manner. And if you want the DB to be consistent at all times, that makes the problem even harder! Though it might explain why RobinHood uses mysql, because it punts that issue to the DB developers.

jbd> As for the parallel code, it will help up until the backend jbd> (NFS on Netapp in my case mostly) hits it's limits I fear.

jbd> Yes, no magic here. But the speedup optained while jbd> hammering/breaking the storage system could be noticeable.

Yes, it is noticeable to spread the load. I've got multiple multi-terabyte filesystems with upto 60 millions files in some of them, so we just index them all seperately on a weekly schedule. Takes approx six to eight hours on some of them to complete.

Looking at Robinhood, they take advantage of Lustre's metadata tracking, which I think is really cool. And I'd love to see if ext4 could be hacked to include that into the filesystem itself as well. Instead of having outside tools do the work, put it into the filesystem. Which is a much much much harder problem in alot of ways, but as filesystems expand... will only help.

But back to the core issue... even getting just two crawlers in parallel would probably be a big win, except on single spindle drives. And of course coordinating the inserts into the DB.

John

l8gravely commented 7 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Quoting John (2016-12-28 17:35:04)

I've just compiled robinhood 3.0, but I need to find and setup a mysql host first. Hopefully I'll get you some numbers in the next day or so.

Ico> Great, thanks for that!

Some preliminary numbers are that duc wins in terms of speed and DB size compared to Robinhood. It's not even finished indexing and the DB is upto 7.1Gb for a 6.1Tb filesystem with 18 million files. I did zero tuning on the mysql side of things, which is obviously impacting the performance of the index. I think I need to just move it all to /tmp (swap back ramfs) and up the numbers and see how it goes then.

Duc DB size for the same filesystem (though probably more heavily used at the time) is 240mb. This is using the tokyocabinet backend, not the lmdb stuff. Which I need to look at.

So my initial opinion is that duc and a dedicated DB design is a real win for this situation. And I say this because I've also used different backup tools which use custom DBs (networker) and SQL based (commvault) and the custom DB wins hands down there in terms of size, speed and capacity.

Quick note, v1.4.2 needs to get updated on the DB options and where to find them.

John

zevv commented 7 years ago

Hi John,

Thanks for the testing this!

Quoting John (2016-12-29 18:36:46)

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Quoting John (2016-12-28 17:35:04)

I've just compiled robinhood 3.0, but I need to find and setup a mysql host first. Hopefully I'll get you some numbers in the next day or so.

Ico> Great, thanks for that!

Some preliminary numbers are that duc wins in terms of speed and DB size compared to Robinhood.

I'm pretty sure Duc wins on the DB size, since thats one of the things I've optimized on using varints and tokyocabinet with compression. I am surprised about the indexing speed, though. Duc does the simplest possible thing here with no smart optimization at all; this makes me wonder if there is not much to gain in using threading (in your particular setup), or if there is much more overhead in Robinhood which might cause it to slow down.

(I haven't looked in all the details about Robinhood, but maybe they store a lot more data then Duc?)

Quick note, v1.4.2 needs to get updated on the DB options and where to find them.

Yeah, I'm afraid the docs are always a bit behind. I'll add lmdb to the INSTALL file, but I guess it would pay to describe this all a bit more.

On the other hand - maybe we should choose a single backend and drop support for the pleora of databases. I'll create a separate ticket to discuss this.

-- :wq ^X^Cy^K^X^C^C^C^C

l8gravely commented 7 years ago

"Ico" == Ico Doornekamp notifications@github.com writes:

Ico> Hi John, Ico> Thanks for the testing this!

It's going slowly for sure with robinhood. But from the DB size, I suspect it won't work in my envrionment.

John> Some preliminary numbers are that duc wins in terms of speed and DB John> size compared to Robinhood.

Ico> I'm pretty sure Duc wins on the DB size, since thats one of the Ico> things I've optimized on using varints and tokyocabinet with Ico> compression. I am surprised about the indexing speed, Ico> though. Duc does the simplest possible thing here with no smart Ico> optimization at all; this makes me wonder if there is not much to Ico> gain in using threading (in your particular setup), or if there Ico> is much more overhead in Robinhood which might cause it to slow Ico> down.

I don't know yet honestly. I'm running the DB on a local disk, so it might just be that I'm hitting the disk IOPs limits. Which again tells me to move to to swap, and to increase the memory buffers and such. Should have done it last night. Will probably kill it off and do it now anyway.

Ico> (I haven't looked in all the details about Robinhood, but maybe Ico> they store a lot more data then Duc?)

Dunno... but it's certainly not optimal in terms of size at all. The DB looks like this:

mysql> show tables;
+------------------------------+
| Tables_in_robinhood_sjscratc |
+------------------------------+
| ACCT_STAT                    |
| ANNEX_INFO                   |
| ENTRIES                      |
| NAMES                        |
| SOFT_RM                      |
| VARS                         |
+------------------------------+
6 rows in set (0.00 sec)

And just doing a "SELECT COUNT(*) FROM NAMES;" is taking forever... so it's almost certainly my DB tuning lack which is hurting robinhood.

Quick note, v1.4.2 needs to get updated on the DB options and where to find them.

Ico> Yeah, I'm afraid the docs are always a bit behind. I'll add lmdb Ico> to the INSTALL file, but I guess it would pay to describe this Ico> all a bit more.

Ico> On the other hand - maybe we should choose a single backend and drop Ico> support for the pleora of databases. I'll create a separate ticket to Ico> discuss this.

Absolutely! We should move from tokyocabinet to lmdb, and I'd also suggest that we pull lmdb into the duc source tree as well, to make it simpler to deploy. It's only 32k in size, so it's trivial. Just make it part of the process to see about pulling upstream changes from time to time if they don't cause problems.

John

jbd commented 7 years ago

zevv> I haven't looked in all the details about Robinhood, but maybe they store a lot more data then Duc

You can have a rough idea of what is stored in robinhood database by quickly looking at the rbh-find command options.

l8gravely> We should move from tokyocabinet to lmdb, and I'd also suggest that we pull lmdb into the duc source tree as well.

FWIW, I fully agree =)

zevv commented 7 years ago

Quoting jbd (2016-12-29 22:09:06)

l8gravely> We should move from tokyocabinet to lmdb, and I'd also
suggest that we pull lmdb into the duc source tree as well.

I fully agree.

see https://github.com/zevv/duc/issues/164

-- :wq ^X^Cy^K^X^C^C^C^C

jbd commented 7 years ago

I lost trace of an interesting paper, but I found it again: On Distributed File Tree Walk of Parallel File Systems.

jbd commented 7 years ago

Out of curiosity, I've tested what looks like an implementation of the previous paper: dwalk.c from the mpifileutils project which is using libdftw underneath. It uses mpi to handle parallelism, but I'm only using the local machine. The goal is only to see the real benefits of a distributed file tree walk.

See also http://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial5.pdf

To test on a CentOS 7 system:

$ sudo yum install openmpi openmpi-devel libattr-devel
$ source /etc/profile.d/modules.sh
$ module load mpi/openmpi-x86_64
$ module list
Currently Loaded Modulefiles:
  1) mpi/openmpi-x86_64
$ git clone https://github.com/hpc/mpifileutils/
$ cd mpifileutils
$ ./buildme_dependencies # requires internet connection
$ ./buildme 

The yum openmpi library seems to have some problems, but it is working:

./install/bin/dwalk 
localhost.31857hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out

Usage: dwalk [options] <path> ...

Options:
  -i, --input <file>  - read list from file
  -o, --output <file> - write processed list to file
  -l, --lite          - walk file system without stat
  -s, --sort <fields> - sort output by comma-delimited fields
  -p, --print         - print files to screen
  -v, --verbose       - verbose output
  -h, --help          - print usage

Fields: name,user,group,uid,gid,atime,mtime,ctime,size

Let's test (as root to access my no-root squash nfs share on 8 cores/16G virtual machine). I've ran each test 5 times and took the best score because the filer could be in heavy use at some point in time. dwalk performs a stat for each entry.

The folder I'm indexing:

Items: 764324
  Directories: 51247
  Files: 713077
  Links: 0
Data: 0.992 TB (1.459 MB per file)
# module load mpi/openmpi-x86_64

One mpi process:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 1 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 71.579805 seconds (10677.927944 files/sec)

Two:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 2 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 50.769305 seconds (15054.844655 files/sec)

Four:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 4 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 32.848336 seconds (23268.271489 files/sec)

Eight:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 8 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 21.685690 seconds (35245.546718 files/sec)

And for fun, 16 =)

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 16 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 18.811476 seconds (40630.729880 files/sec)

I don't know where the bottleneck is (nfs, local machine or remote storage) but the gain is substantial. It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results.

zevv commented 7 years ago

Quoting jbd (2016-12-30 11:45:22)

I lost trace of an interesting paper, but I found it again: On Distributed File Tree Walk of Parallel File Systems.

Thanks for the link; I think I've seen this when looking for an appropriate algorithm, but if I recall correctly I did not figure out how to make this fit Duc's purpose. The problem solved in this paper is to visit each directory exactly once and do some work there, but for Duc the file system needs to be walked recursively and depth first because the size of each directory is the sum of the size of all its files and subdirectories. Order matters here.

While typing the above, it occured to me that it might be possible to change the process though: Duc could first do a complete walk of the filesystem and only count the sum of files in each directory - this could be an complete parallel process using one of the algorithms from the paper. After the indexing run, the directory sizes could be recursively added from the database without having to touch the file system.

This sounds like a decent solution. I'll ponder this for some time and do some prototyping to see if I can grok and implement the parallelization and rerecursivization.

-- :wq ^X^Cy^K^X^C^C^C^C

jbd commented 7 years ago

Order matters here.

Indeed, that was something I didn't have in mind.

jbd commented 7 years ago

It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results.

I don't know if you're interested, but I've tried to implement a --dry-run option to the index (https://github.com/zevv/duc/pull/166). Given your last remark, maybe it is not as useful as I thought... =)

zevv commented 7 years ago

Quoting jbd (2016-12-30 16:37:05)

Out of curiosity, I've tested what looks like an implementation of the previous paper: dwalk.c from the mpifileutils project which is using libdftw underneath. It uses mpi to handle parallelism, but I'm only using the local machine. The goal is only to see the real benefits of a distributed file tree walk.

See also http://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial5.pdf

Thanks for testing, nice to know what kind of speedup can be expected using parallelism.

Your test results:

1: 71.579805 seconds (10677.927944 files/sec) 2: 50.769305 seconds (15054.844655 files/sec) 4: 32.848336 seconds (23268.271489 files/sec) 8: 21.685690 seconds (35245.546718 files/sec) 16: 21.121180 seconds (36187.561490 files/sec)

That makes sense: http://zevv.nl/div/graph.png

I don't know where the bottleneck is (nfs, local machine or remote storage) but the gain is substantial. It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results.

Test branch for you:

https://github.com/zevv/duc/tree/index-noop

Use the --no-operation flag to the 'duc index' command. Duc now traverses and does fstat() but does not serialize data or write to the database.

(Do you know if the MPI test app actually calls fstat() for each file?)

-- :wq ^X^Cy^K^X^C^C^C^C

jbd commented 7 years ago

Grrr, I've lost my answer.

Test branch for you

Thank you for merging my pull request. Let's hope I didn't introduce a subtle bug in the process =)

Here is the result (best of five runs) on the same dataset:

# echo 3 | tee /proc/sys/vm/drop_caches;
# rm -f /root/.duc.db* 
# time ./duc-lmdb index --dry-run -x -p /data
real    0m53.694s
user    0m1.114s
sys     0m11.823s

A bit faster ! That gives us a rough idea.

(Do you know if the MPI test app actually calls fstat() for each file?)

Since the dwalk has a "--lite - walk file system without stat" option, I hope that the default behaviour is to perform a stat for each entry. An strace -f -estat,lstat confirms that.

By quickly looking at the code, the bayer_flist_walk_paths function enqueues walk_stat_process callbacks that issue lstat call in the end. bayer_flist_walk_paths is also doing other stuff regarding user and group (but I don't know if my previous runs felt inside this code path).

zevv commented 7 years ago

Quoting jbd (2016-12-30 18:29:02)

Grrr, I've lost my answer.

Test branch for you

Thank you for merging my pull request. Let's hope I didn't introduce a subtle bug in the process =)

I usually do, it would be nice to be able to blame someone every now and then :)

Here is the result (best of five runs) on the same dataset:

# echo 3 | tee /proc/sys/vm/drop_caches;
# rm -f /root/.duc.db* 
# time ./duc-lmdb index --dry-run -x -p /data
real    0m53.694s
user    0m1.114s
sys     0m11.823s

A bit faster ! That gives us a rough idea.

(Do you know if the MPI test app actually calls fstat() for each file?)

Since the dwalk has a "--lite - walk file system without stat" option, I hope that the default behaviour is to perform a stat for each entry.

That makes sense.

So, using this algorithm for parallel walking should bring us a 2 or 3 times speedup, but there still is the problem to map/reduce the total directory sizes in recursive order.

My implementation (which I seem to have thrown away, a pity) did something like this:

I was pretty happy with the performance (2.5x speedup, iirc), but the implementation was just too complicated. I'm probably just not smart enough to write and maintain multithreaded applications.

-- :wq ^X^Cy^K^X^C^C^C^C

zevv commented 7 years ago

(Some ramblings in case I leave this for a year and forget about it - this happens to me far too often.)

Ok, I've spent some time thinking and prototyping, but there's one nasty problem on the road:

At this time Duc recursively chdir()s itself through the file system, thus only needing relative file names for opendir(). This breaks when doing multiple threads since there is only one CWD (current working directory) for all threads. The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.

Two solutions come to mind:

stuartthebruce commented 7 years ago

On Jan 1, 2017, at 11:25 AM, Ico Doornekamp notifications@github.com wrote:

(Some ramblings in case I leave this for a year and forget about it - this happens to me far too often.)

Ok, I've spent some time thinking and prototyping, but there's one nasty problem on the road:

At this time Duc recursively chdir()s itself through the file system, thus only needing relative file names for opendir(). This breaks when doing multiple threads since there is only one CWD (current working directory) for all threads. The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows.

Another option for MacOS X support would be to check if the nextgen Apple filesystem supports relative paths, and potentially wait for that, https://developer.apple.com/library/content/documentation/FileManagement/Conceptual/APFS_Guide/Introduction/Introduction.html

-- Stuart Anderson anderson@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson

l8gravely commented 7 years ago

Ico> (Some ramblings in case I leave this for a year and forget about Ico> it - this happens to me far too often.)

Same here.... :-(

Ico> Ok, I've spent some time thinking and prototyping, but there's Ico> one nasty problem on the road:

Ico> At this time Duc recursively chdir()s itself through the file Ico> system, thus only needing relative file names for opendir(). This Ico> breaks when doing multiple threads since there is only one CWD Ico> (current working directory) for all threads. The proper solution Ico> for this is to use openat() and fdopendir() which allow Ico> referencing of file names relative to a parent file Ico> descriptor. The only problem is that for some reason this is not Ico> supported on MacOS X and Windows. I've stopped caring for Ico> windows a long time ago, but I really would like to keep MacOS in Ico> the list of supported operating systems.

What about fchdir() instead? You opendir() a directory, and then do a 'fchdir(dir)' into that directory. Keep it per-thread.

Ico> Two solutions come to mind:

Ico> • Use processes instead of threads. Actually not that bad, although this will Ico> complicate the communication in the worker pool with some overhead. (pipe Ico> write/read = system calls)

I'm not as familiar with MacOS/Windows IPC models, so I can't really comment here. For the Unix side, it shouldn't be too hard, it's mostly the waiting for completion that is a little tricky, and keeping count of course.

Ico> • Have duc always keep track of absolute path names and pass those to opendir Ico> (). This causes a bit of overhead with some string handling, but is not Ico> that bad.

This is probably the simplest, esp since it's trivial to just keep track of the directory name in each thread. The memory overhead shouldn't be too pad, PATH_MAX bytes + some misc per-thread.

John

zevv commented 7 years ago

Quoting John (2017-01-01 22:06:17)

Ico> • Have duc always keep track of absolute path names and pass those to opendir Ico> (). This causes a bit of overhead with some string handling, but is not Ico> that bad.

This is probably the simplest, esp since it's trivial to just keep track of the directory name in each thread. The memory overhead shouldn't be too pad, PATH_MAX bytes + some misc per-thread.

It's not really about the string overhead, more about the extra work for the file system layer to resolve the same path over and over again. For example, when indexing something like:

/usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/igmp /usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/nf /usr/src/linux-headers-4.6.0-1-amd64/include/config/bridge/ebt

now, duc simple chdirs to the 'bridge' directory and does a stat() for each entry in the dir. Without chdir, the stat() will be done on the full path, so the vfs layer will have too vind 'usr', then find 'src', then 'linux-headers-4.6.0-1-amd64', and so on. Of course a lot of this will live in cache but still there's more work to be done in the kernel. This is why openat() exists (albeit not on MacOS X).

I'm working to do some tests on a patched duc that does just that (in a single thread, still) to see what the impact would be.

-- :wq ^X^Cy^K^X^C^C^C^C

msm595 commented 5 years ago

The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.

As of MacOS 10.10 (Darwin 14.0), which was released in the end of 2014, openat() is supported. You can see the full list of supported syscalls for this version here: https://opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/kern/syscalls.master.auto.html

zevv commented 5 years ago

Quoting Alex (2019-01-19 18:48:47)

The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.

As of MacOS 10.10 (Darwin 14.0), which was released in the end of 2014, openat() is supported. You can see the full list of supported syscalls for this version here: https://opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/kern/syscalls.master.auto.html

Thanks, that is a good reason to rethink those requirement and possible refactor the indexes. I have never heard of a single user running on windows, so I would gladly drop support for that.

side note: Duc also runs on android (via termux), which also has openat() and friends.

-- :wq ^X^Cy^K^X^C^C^C^C

muhammadharis commented 3 years ago

I'm currently working on an worker-pool based implementation based on the DFS algorithm described in this paper:

https://www.lrde.epita.fr/~bleton/doc/parallel-depth-first-search.pdf

If anyone is interested in a proof-of-concept, I can provide a Python script that minimally illustrates the filesystem crawling algorithm. If the maintainers have suggestions/opinions on this crawling algorithm, please let me know -- I'm open to suggestions.

l8gravely commented 3 years ago

"Muhammad" == Muhammad Haris @.***> writes:

Muhammad> I'm currently working on an worker-pool based implementation Muhammad> based on the DFS algorithm described in this paper:

Muhammad> https://www.lrde.epita.fr/~bleton/doc/parallel-depth-first-search.pdf

Muhammad> If anyone is interested in a proof-of-concept, I can provide Muhammad> a Python script that minimally illustrates the filesystem Muhammad> crawling algorithm. If the maintainers have Muhammad> suggestions/opinions on this crawling algorithm, please let Muhammad> me know -- I'm open to suggestions.

Sure, but I think you might want to implement this in C instead, because (as I dimly recall) Python isn't quite threadsafe and doesn't scale well because of locking overhead. But I could be wrong.

Please feel free to post any results you get along with sample code for us to look over.

John