zkat / chanl

Portable channel-based concurrency for Common Lisp
Other
169 stars 17 forks source link

[BUG] Deadlock cycle from thread pool implementation #31

Open adlai opened 3 days ago

adlai commented 3 days ago

Describe the bug As initially reported in adlai#5, SBCL occasionally reports deadlock caused by the thread pool.

To Reproduce Steps to reproduce the behavior:

  1. Run multiple parallel threads that constantly produce new tasks, similarly to scalpl's actors.
  2. Wait for RNGesus to schedule the deadlock cycle.
  3. Receive Condition of type SB-THREAD:THREAD-DEADLOCK
  4. Report this fact, and possibly additional findings, in this Issue; or invoke the ABORT restart and enjoy the hardworking actors.lisp rescheduling of the tasks.

Expected behavior Ideally, either there should not be any deadlock; alternatively, tasks could be restartable automatically.

Screenshots or Error Logs

Deadlock cycle detected:
    #1=#<SB-THREAD:THREAD #2="ChanL Thread Pool Worker" RUNNING {1001369D53}>
  waited for:
    #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
  owned by:
    #<SB-THREAD:THREAD "ChanL Thread Pool Worker" waiting on: #3=#<MUTEX "thread leader lock" contested owner=#2#> {100163B403}>
  waited for:
    (#3#)
  owned by:
    #1#
   [Condition of type SB-THREAD:THREAD-DEADLOCK]

Restarts:
 0: [ABORT] abort thread (#<THREAD "ChanL Thread Pool Worker" RUNNING {1001369D53}>)

Backtrace:
  0: (SB-THREAD::CHECK-DEADLOCK)
  1: (SB-THREAD::%WAIT-FOR-MUTEX #<SB-THREAD:MUTEX "thread pool lock" free owner=0> NIL NIL NIL NIL NIL NIL)
  2: ((FLET "CLEANUP-FUN-8" :IN SB-THREAD::%CONDITION-WAIT)) [cleanup]
  3: ((FLET "WITHOUT-INTERRUPTS-BODY-3" :IN SB-THREAD::%CONDITION-WAIT))
  4: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
  5: (BORDEAUX-THREADS:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}> #<SB-THREAD:MUTEX "thread pool lock" free owner=0> :TIMEOUT NIL)
      Locals:
        CONDITION-VARIABLE = #<SB-THREAD:WAITQUEUE Anonymous condition variable {1003EEFA43}>
        LOCK = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
        TIMEOUT = NIL
  6: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
      [No Locals]
  7: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
  8: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}> #<SB-THREAD:MUTEX "thread pool lock" free own$
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread pool lock" free owner=0>
        SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E51B}>
        TIMEOUT = NIL
        WAITP = T
  9: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD))
      [No Locals]
 10: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
 11: (SB-THREAD::CALL-WITH-MUTEX #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}> #<SB-THREAD:MUTEX "thread leader lock" contes$
      Locals:
        GOT-IT = T
        MUTEX = #<SB-THREAD:MUTEX "thread leader lock" contested owner=ChanL Thread Pool Worker>
        SB-C::THING = #<FUNCTION (FLET SB-THREAD::WITH-MUTEX-THUNK :IN CHANL::NEW-WORKER-THREAD) {7F4363D3E79B}>
        TIMEOUT = NIL
        WAITP = T
 12: ((LAMBDA NIL :IN CHANL::NEW-WORKER-THREAD))
      Locals:
        TASK = #<TASK 6:55:56 3P4 USDCAUD fee tracker [TERMINATED] {1006535A93}>
        CHANL::THREAD-POOL = #<CHANL::THREAD-POOL {1003857FC3}>
 13: ((LABELS BORDEAUX-THREADS::%BINDING-DEFAULT-SPECIALS-WRAPPER :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))

Source locations for frames 12, 9, and 6:

(defun new-worker-thread (thread-pool &optional task)
  (push (bt:make-thread
         (lambda ()
           (unwind-protect
                (loop
                   (when task
                     (unwind-protect
                          (progn (setf (task-thread task) (current-thread))
                                 (setf (task-status task) :alive)
                                 (funcall (task-function task)))
                       (setf (task-thread task) nil)
                       (setf (task-status task) :terminated)
                       (bt:with-lock-held ((pool-lock thread-pool))
                         (setf (pool-tasks thread-pool)
                               (remove task (pool-tasks thread-pool))))))
                   (bt:with-lock-held ((pool-lock thread-pool))
                     (if (and (pool-soft-limit thread-pool)
                              (> (length (pool-threads thread-pool))
                                 (pool-soft-limit thread-pool)))
                         (return)
                         (incf (free-thread-counter thread-pool))))
                   (bt:with-lock-held ((pool-leader-lock thread-pool))                  ; [12]
                     (bt:with-lock-held ((pool-lock thread-pool))                        ; [9]
                       (setf task
                             (loop until (pool-pending-tasks thread-pool)
                                do (bt:condition-wait (pool-leader-notifier thread-pool) ; [6]
                                                      (pool-lock thread-pool))
                                finally (return (pop (pool-pending-tasks thread-pool)))))
                       (decf (free-thread-counter thread-pool)))))
             (bt:with-lock-held ((pool-lock thread-pool))
               (setf (pool-threads thread-pool)
                     (remove (bt:current-thread) (pool-threads thread-pool))))))
         :name "ChanL Thread Pool Worker")
        (pool-threads thread-pool)))

System (all systems where this has been observed):

Test Suite Regrettably, the test suite does not detect this deadlock. I'm considering whether it is worth including a sufficiently vigorous test case that the deadlock occurs with high probability.

adlai commented 12 hours ago

32 and 4cbbcda begin fixing this