When used in targets, crew backend R instances do not close automatically after user cancelled - Githubissues

wlandau / crew

A distributed worker launcher

https://wlandau.github.io/crew/

Other

123 stars 4 forks source link

When used in targets, crew backend R instances do not close automatically after user cancelled #141

Closed psychelzh closed 9 months ago

psychelzh commented 9 months ago

Prework

[x] Read and agree to the Contributor Code of Conduct and contributing guidelines.
[x] Confirm that your issue is a genuine bug in the crew package itself and not a user error, known limitation, or issue from another package that crew depends on. For example, if you get errors running tar_make_clustermq(), try isolating the problem in a reproducible example that runs clustermq and not crew. And for miscellaneous troubleshooting, please post to discussions instead of issues.
[x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
[x] Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [x] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [x] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [x] Readable: format your code according to the tidyverse style guide.

Description

I am using the lastest 0.7 version of crew package. After I run targets::tar_make(), the front-end R terminal will invoke back-end R instances. However, if I canceled the pipeline manually, these instances won't close and continue to run until they finished.

Reproducible example

[x] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- [x] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- [x] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- [x] Readable: format your code according to the tidyverse style guide.

This pipeline is very useful to debug this problem.

# _targets.R
library(targets)
tar_option_set(
  controller = crew::crew_controller_local(workers = 8)
)

list(
  tarchetypes::tar_rep(
    test,
    Sys.sleep(100),
    batches = 30,
    reps = 1
  )
)

Expected result

After I canceled the pipeline, all backend R instances should terminate automatically.

wlandau commented 9 months ago

crew uses mirai, and the behavior you see is due to https://github.com/shikokuchuo/mirai/issues/87 (also c.f. https://github.com/shikokuchuo/mirai/issues/86#issuecomment-1846032626). Unfortunately there is nothing I can do from the perspective of crew or targets, but those crew workers will exit on their own once they finish the tasks that were in progress during the interrupt.

targets does have a callback to manually terminate its crew workers on error:

https://github.com/ropensci/targets/blob/cc9a6a23b4c25c7048a303535da3ed05c0682d9a/R/class_crew.R#L262-L277

Unfortunately, because the pipeline runs in a background callr process, the on.exit() callbacks are skipped when the inner process is abruptly terminated. I think counteracting this behavior would require installing non-default operating system signal handlers, which seems risky.

wlandau commented 9 months ago

That said, I may reopen this issue, depending on the result of https://github.com/shikokuchuo/mirai/issues/87

wlandau commented 9 months ago

From https://github.com/shikokuchuo/mirai/issues/87 and https://github.com/shikokuchuo/mirai/pull/88, it looks like there will eventually be support from mirai via nanonext. Reopening.

wlandau commented 9 months ago

Fixed in aeeef8f19895418e2015b1227c84934409527a9c. Requires development nanonext and mirai for now.

If the pipeline is interrupted, worker processes now send themselves SIGKILL to force quit abruptly. If you want a softer exit which allows cleanup/writing to complete before exiting, you can supply tools::SIGINT to the signal argument of crew_controller_local(). It will take some time before I can propagate the signal argument to launcher plugins in other packages.

wlandau commented 9 months ago

If the pipeline is interrupted, worker processes now send themselves SIGKILL to force quit abruptly. If you want a softer exit which allows cleanup/writing to complete before exiting, you can supply tools::SIGINT to the signal argument of crew_controller_local(). It will take some time before I can propagate the signal argument to launcher plugins in other packages.

Rethinking that choice, c.f. https://github.com/shikokuchuo/mirai/issues/87#issuecomment-1852426711:

An update: as I was testing just now, I found out that tools::SIGKILL is actually NA on Windows. That led me to do some more digging on signals, and I learned SIGKILL on a parent process can lead to zombie child processes. Zombies would defeat the whole purpose of my original preference for SIGKILL, so I am moving away from it. Instead, crew now prefers SIGTERM because the intention to kill the process is more explicit than SIGINT. If for some reason that signal is not defined on the user's platform, then it uses SIGQUIT. And if SIGQUIT is not defined, it uses SIGINT as a last resort. I don't think I will expose the choice of signal to crew users.

psychelzh commented 9 months ago

Thanks for all the kind considerations into this. Here confirmed that the development version works as expected perfectly on Windows, too.