wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

When used in targets, crew backend R instances do not close automatically after user cancelled #141

Closed psychelzh closed 9 months ago

psychelzh commented 9 months ago

Prework

Description

I am using the lastest 0.7 version of crew package. After I run targets::tar_make(), the front-end R terminal will invoke back-end R instances. However, if I canceled the pipeline manually, these instances won't close and continue to run until they finished.

Reproducible example

This pipeline is very useful to debug this problem.

# _targets.R
library(targets)
tar_option_set(
  controller = crew::crew_controller_local(workers = 8)
)

list(
  tarchetypes::tar_rep(
    test,
    Sys.sleep(100),
    batches = 30,
    reps = 1
  )
)

Expected result

After I canceled the pipeline, all backend R instances should terminate automatically.

wlandau commented 9 months ago

crew uses mirai, and the behavior you see is due to https://github.com/shikokuchuo/mirai/issues/87 (also c.f. https://github.com/shikokuchuo/mirai/issues/86#issuecomment-1846032626). Unfortunately there is nothing I can do from the perspective of crew or targets, but those crew workers will exit on their own once they finish the tasks that were in progress during the interrupt.

targets does have a callback to manually terminate its crew workers on error:

https://github.com/ropensci/targets/blob/cc9a6a23b4c25c7048a303535da3ed05c0682d9a/R/class_crew.R#L262-L277

Unfortunately, because the pipeline runs in a background callr process, the on.exit() callbacks are skipped when the inner process is abruptly terminated. I think counteracting this behavior would require installing non-default operating system signal handlers, which seems risky.

wlandau commented 9 months ago

That said, I may reopen this issue, depending on the result of https://github.com/shikokuchuo/mirai/issues/87

wlandau commented 9 months ago

From https://github.com/shikokuchuo/mirai/issues/87 and https://github.com/shikokuchuo/mirai/pull/88, it looks like there will eventually be support from mirai via nanonext. Reopening.

wlandau commented 9 months ago

Fixed in aeeef8f19895418e2015b1227c84934409527a9c. Requires development nanonext and mirai for now.

If the pipeline is interrupted, worker processes now send themselves SIGKILL to force quit abruptly. If you want a softer exit which allows cleanup/writing to complete before exiting, you can supply tools::SIGINT to the signal argument of crew_controller_local(). It will take some time before I can propagate the signal argument to launcher plugins in other packages.

wlandau commented 9 months ago

If the pipeline is interrupted, worker processes now send themselves SIGKILL to force quit abruptly. If you want a softer exit which allows cleanup/writing to complete before exiting, you can supply tools::SIGINT to the signal argument of crew_controller_local(). It will take some time before I can propagate the signal argument to launcher plugins in other packages.

Rethinking that choice, c.f. https://github.com/shikokuchuo/mirai/issues/87#issuecomment-1852426711:

An update: as I was testing just now, I found out that tools::SIGKILL is actually NA on Windows. That led me to do some more digging on signals, and I learned SIGKILL on a parent process can lead to zombie child processes. Zombies would defeat the whole purpose of my original preference for SIGKILL, so I am moving away from it. Instead, crew now prefers SIGTERM because the intention to kill the process is more explicit than SIGINT. If for some reason that signal is not defined on the user's platform, then it uses SIGQUIT. And if SIGQUIT is not defined, it uses SIGINT as a last resort. I don't think I will expose the choice of signal to crew users.

psychelzh commented 9 months ago

Thanks for all the kind considerations into this. Here confirmed that the development version works as expected perfectly on Windows, too.