wingo / fibers

Concurrent ML-like concurrency for Guile
GNU Lesser General Public License v3.0
311 stars 32 forks source link

In procedure pipe: Too many open files #52

Open linas opened 2 years ago

linas commented 2 years ago

After monkeying with fibers for 5-10 minutes, I get the error:

In fibers.scm:
WARNING: (guile-user)(guile-user)   113:22  5 :
:  
In fibers/scheduler.scm:
imported mm   105:14  4 (make-scheduler #:parallelism _ #:prompt-tag _)
oduu
In fibers/epoll.scm:
le (fibers) oo    94:31  3 (epoll-create #:close-on-exec? _ #:maxevents _)
verrides      53:14  2 (make-wake-pipe)
cc
In unknown file:
           1 (pipe)
ore binn
In ice-9/boot-9.scm:
ding `sleep''  1669:16  0 (raise-exception _ #:continuable? _)

ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure pipe: Too many open files

I've gotten this several times now, each time after a clean restart. This is using git clone of today's git.

I do not have any simple test case to reproduce this: my fibers code is very simple, but it's working within a large & complex framework of other code. (specifically, inside of https://github.com/opencog/learn/) (you won't find fibers code in there, because I'm still experimenting with it.)

(I'm trying to see if I can use fibers to parallelize a work-farm (a local async proceedure call): I've got a subroutine that is slow but is called many times. I cannot use par-for-each, because that subroutine is called from several different places under quite different circumstances. So fibers seemed ideal for this. Of course I could just create a thread-pool, too.)

linas commented 2 years ago

The above print is garbled, but then, running the code a second time gives a cleaner print:

WW
In fibers.scm:
ARNING: (guile-user): imported module (fibers) overrides core binding `sleep'
   113:22  5 (run-fibers _ #:hz _ #:scheduler _ #:parallelism _ #:cpus _ # _ #:drain? …)
In fibers/scheduler.scm:
   105:14  4 (make-scheduler #:parallelism _ #:prompt-tag _)
In fibers/epoll.scm:
    94:31  3 (epoll-create #:close-on-exec? _ #:maxevents _)
    53:14  2 (make-wake-pipe)
In unknown file:
           1 (pipe)
In ice-9/boot-9.scm:
  1669:16  0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure pipe: Too many open files

At the repl, I see:

> ,scheds
No schedulers.

Manually running (gc) a bunch of times does not clean up whatever it is that leaked the pipes.

linas commented 2 years ago

This may be related to issue #36 except that, for me, running (gc) a bunch of times does not fix the leak. My version of guile is the one that comes as default with ubuntu focal 20.04:

$ guile -v
guile (GNU Guile) 3.0.1
Packaged by Debian (3.0.1-deb+1-2)
Copyright (C) 2020 Free Software Foundation, Inc.

Issue #36 says:

So the epoll file descriptor and the two wake-pipe file descriptors for each remote scheduler stay open until garbage-collected (which until recently still wouldn't close them due to a guile bug with revealed-count).

I have no idea whether this is fixed in 3.0.1 or not.

linas commented 2 years ago

My problem is not fixed by the patch provided in issue #36 although it does give me a clean stack trace, unlike the garbled ones above:

In fibers.scm:                              
   113:22 11 (run-fibers _ #:hz _ #:scheduler _ #:parallelism _ #:cpus _ # _ #:drain? …)
In fibers/scheduler.scm:                                                                
   121:22 10 (make-scheduler #:parallelism _ #:prompt-tag _)
In ice-9/boot-9.scm:                                                                    
   222:29  9 (map1 (0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22))
   222:29  8 (map1 (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22))
   222:29  7 (map1 (2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22))
   222:29  6 (map1 (3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22))
   222:17  5 (map1 (4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22))
In fibers/scheduler.scm:                                                                
   105:14  4 (make-scheduler #:parallelism _ #:prompt-tag _)
In fibers/epoll.scm:
    94:31  3 (epoll-create #:close-on-exec? _ #:maxevents _)
    53:14  2 (make-wake-pipe)
In unknown file:
           1 (pipe)
In ice-9/boot-9.scm:
  1669:16  0 (raise-exception _ #:continuable? _)
ice-9/boot-9.scm:1669:16: In procedure raise-exception:
In procedure pipe: Too many open files
aconchillo commented 2 years ago

@linas anyway you can try 3.0.2?

linas commented 2 years ago

I'll try. My datasets are in a production environment, I'll have to futz around a bit to port everything to a test environment. My understanding of the comments in #36 is that the patch there should fix things, even for 3.01, and it doesn't. (It should have worked, because the close is explicit; whereas the close-via-gc needs something newer than 3.0.2)

Reepca commented 2 years ago

Some stuff worth noting:

  1. Each scheduler (there are as many created as #:parallelism specifies, which by default is (current-processor-count)) requires at least 5 file descriptors: 2 for a wake pipe for the scheduler, 2 for a sleep pipe created internally by Guile for each kernel thread, and 1 for the epoll file descriptor used by the scheduler. If interrupts are done using threads (as they are on my system), another thread is also created for each scheduler, which adds another 2 file descriptors, for a total of 7.
  2. From the stack trace you posted, it looks like you have (= (current-processor-count) 24). This means in regular conditions you should be using at least around 120 file descriptors within the extent of run-fibers. This shouldn't be particularly close to the limit, but if your process is already using many file descriptors, there's a good chance of hitting the limit there, especially if the file descriptor limit is low. It would be useful to know what ulimit -n reports. On my system it's 1024, able to be increased up to 2048 with normal permissions.
  3. I'm not sure what's causing the file descriptor exhaustion you see, but there is a test that can rule out #36: run run-fibers with an empty body in a loop 10000 or so times from a fresh Guile. If you don't run out of file descriptors from that, #36 isn't at play. If you do, then I'm not sure what to do except triple-check that the patched version is what's actually getting loaded (maybe throw in a print statement somewhere to be sure?).
  4. If you're on Linux, you could try (system* "ls" "-l" (string-append "/proc/" (number->string (getpid)) "/fd")) at various points in your code. It should give you an idea of how many file descriptors are in use at a given point, and what's using them. If you're running into #36, it should list a ton of pipes, and some epoll instances (looks like anon_inode:[eventpoll]).
  5. From my experimenting, it seems that the resources of native threads (e.g. as produced by call-with-new-thread) aren't freed until they are garbage collected. This may happen some time after they are canceled and joined. If you inspect (all-threads) and also look at the number of file descriptors used right after calling run-fibers, you'll find that some of them stick around even after it returns (though these disappear when (gc) is called). These are most likely the sleep pipe fds for each thread. These shouldn't be affected by the port-revealed bug in guile 3.0.1 and earlier, though, because they're never wrapped in ports and only ever used internally. With #:parallelism 24, I had 49 threads live during run-fibers and between 175 and 179 file descriptors open. Immediately after run-fibers, 61 file descriptors were open, though that returned to the initial 13 after waiting a little. With certain timing patterns, these leftover file descriptors and threads can accumulate. Below is a code example that can reproduce this accumulation. On my system a given snapshot typically has over 400 open file descriptors and between 140 and 160 existent threads. Each scheduler should only have 7 file descriptors and 2 threads, for a total of 168 extra file descriptors and 48 extra threads. In rare cases it can even get high enough that when run-fibers attempts to create a thread, creating the sleep pipe fails and guile aborts.
    
    (use-modules (fibers)
             (ice-9 threads)
             (srfi srfi-1))

(define (count-used-fds) (let next-fd ((fd 0) (used 0)) (if (< fd 4096) (next-fd (1+ fd) (if (false-if-exception (fcntl fd F_GETFL)) (1+ used) used)) used)))

(define (report) (format #t "~S~%" (count-used-fds)) (format #t "all threads count: ~S~%" (length (all-threads))) (format #t "live threads count: ~S~%" (count (negate thread-exited?) (all-threads))))

(begin (let loop ((n 0)) (when (< n 10000) (run-fibers (lambda () ;; for some reason this timing pattern causes leftover fds ;; and threads to accumulate (when (= (remainder n 50) 0) (format #t "~%In run-fibers (n = ~S):~%" n) (report))

t)

      #:drain? #t
      #:parallelism 24)
  (loop (1+ n))))

(system* "ls" "-l" (string-append "/proc/" (number->string (getpid)) "/fd")) (report))


I don't know if your usage patterns look anything like this, but it is in theory still possible to exhaust file descriptors by repeatedly calling `run-fibers` even after #36 has been fixed, it's just a faint possibility instead of a certainty. It's worth noting that I only have `(= (current-processor-count) 4)`, so the timing I get may be quite different from yours even though I pass `#:parallelism 24` in my tests.

Hope that helps.