wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

Reliable segfault in a test on Mac OS #104

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

@shikokuchuo, I am rerunning all my local crew tests on Mac OS using https://github.com/shikokuchuo/mirai/commit/c1dd3ffcd64f5134c83e57bdcd25a96e4f6a53e8 and https://github.com/shikokuchuo/nanonext/commit/229e3c63bd4b6449cea24342def7594124c2f696. I am still really excited about https://github.com/shikokuchuo/mirai/commit/3f15eadc04045252d501772d519c54cecc583f0a because it seems to solve host/dispatcher disconnection issues (except for this one instance).

I only found one issue, and it occurs in https://github.com/wlandau/crew/blob/main/tests/throughput/test-transient-wait.R. The code below is a slightly simplified version of the test. When I submit 100 tasks and wait for just one of them, a subsequent attempt to restart the host R session results in a crash, and the dispatcher keeps running indefinitely. Luckily, this time the crash always happens on my end, so I should be able to make this example simpler.

library(crew)
x <- crew_controller_local(
  name = "test",
  tasks_max = 1L,
  workers = 4L
)
x$start()
for (index in seq_len(100)) {
  x$push(command = Sys.sleep(10))
}
x$wait(mode = "one")
rstudioapi::restartSession() # segfaults here
wlandau commented 1 year ago

Simplified down to this:

library(crew)
x <- crew_controller_local()
x$start()
for (index in seq_len(100)) {
  x$push(command = Sys.sleep(10))
}
x$wait(mode = "one")
rstudioapi::restartSession()
wlandau commented 1 year ago

Even further:

library(crew)
x <- crew_controller_local()
x$start()
for (index in seq_len(3)) {
  x$push(command = Sys.sleep(1))
}
x$wait(mode = "one")
rstudioapi::restartSession()
wlandau commented 1 year ago

Even simpler, and without crew:

mirai::daemons(n = 1L, url = "ws://127.0.0.1:5004", dispatcher = TRUE, token = FALSE)
tasks <- replicate(2L, mirai::mirai(TRUE))
Sys.sleep(1)
rstudioapi::restartSession()
wlandau commented 1 year ago

I can reliably reproduce https://github.com/wlandau/crew/issues/104#issuecomment-1668483672 on my local Ubuntu machine too. So it looks like a non-crew issue, at the level of mirai or below. I will file a new issue in the mirai repo.