wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
129 stars 4 forks source link

mirai scheduling with callr workers #29

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

Prework

Related GitHub issues and pull requests

Summary

This PR rewrites crew using mirai as the underlying task scheduler. @shikokuchuo, thanks to your hard work on mirai and nanonext, crew looks much more promising. I really appreciate all the time you spent helping me and implementing new features.

Because of the one-port-per-worker requirement of the active queue, I still intend to write an alternative backend using rrq/Redis as the task scheduler. Either one will eventually win out over the other, or we can keep both and let users decide which one to run. An all-angles approach seems constructive at the moment.

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change

Comparison is base (f6a5769) 100.00% compared to head (8cff4e8) 100.00%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #29 +/- ## ========================================== Coverage 100.00% 100.00% ========================================== Files 14 16 +2 Lines 595 458 -137 ========================================== - Hits 595 458 -137 ``` | [Impacted Files](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau) | Coverage Δ | | |---|---|---| | [R/class\_monad.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jbGFzc19tb25hZC5S) | `100.00% <100.00%> (ø)` | | | [R/crew\_eval.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X2V2YWwuUg==) | `100.00% <100.00%> (ø)` | | | [R/crew\_mirai\_controller.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X21pcmFpX2NvbnRyb2xsZXIuUg==) | `100.00% <100.00%> (ø)` | | | [R/crew\_mirai\_controller\_callr.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X21pcmFpX2NvbnRyb2xsZXJfY2FsbHIuUg==) | `100.00% <100.00%> (ø)` | | | [R/crew\_mirai\_launcher\_callr.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X21pcmFpX2xhdW5jaGVyX2NhbGxyLlI=) | `100.00% <100.00%> (ø)` | | | [R/crew\_mirai\_router.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X21pcmFpX3JvdXRlci5S) | `100.00% <100.00%> (ø)` | | | [R/crew\_multi\_controller.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X211bHRpX2NvbnRyb2xsZXIuUg==) | `100.00% <100.00%> (ø)` | | | [R/crew\_wait.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X3dhaXQuUg==) | `100.00% <100.00%> (ø)` | | | [R/utils\_condition.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui91dGlsc19jb25kaXRpb24uUg==) | `100.00% <100.00%> (ø)` | | | [R/utils\_ids.R](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui91dGlsc19pZHMuUg==) | `100.00% <100.00%> (ø)` | | | ... and [3 more](https://codecov.io/gh/wlandau/crew/pull/29?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

wlandau commented 1 year ago

Tests look good on Unix-like systems, but Windows GitHub Actions runners may have stalled. Not sure why.

wlandau commented 1 year ago

I think I'll merge now and then figure out what to do about Windows later.

shikokuchuo commented 1 year ago

Tests look good on Unix-like systems, but Windows GitHub Actions runners may have stalled. Not sure why.

Sometimes on GH tests try to terminate before all background processes exit causing a hang. Just need a suitable sleep towards the end of your tests most likely. Not so common on Winbuilder - but you will need to test there to be sure.

wlandau commented 1 year ago

How long of a Sys.sleep() has worked for you?

shikokuchuo commented 1 year ago

I am using 3L - but only because I use ephemeral mirai in my tests (non-daemon) and they require 2s to shut down. So I am guessing 1L should be enough in your case.

The 2s is to guard against the R session being reaped faster than the underlying C code can finish sending say a very large object (e.g. GBs). It can happen! But that only applies to the short-lived mirai which exit after each task, not daemons which run continuously.

shikokuchuo commented 1 year ago

I have noticed your use of 'transient' workers by setting max_tasks = 1L. I have added the appropriate 2s guard in mirai in f80e4e6 for those processes exiting due to task/time outs, otherwise the daemon can exit before it can send back it's results. So yes, for tests Sys.sleep(3L) would probably be a good idea.

wlandau commented 1 year ago

Thanks for the tip! Hopefully https://github.com/wlandau/crew/commit/fa924fd5a4f926ce14ba368c8f5548608da2a3cf does the trick.