wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
129 stars 4 forks source link

Set max consecutive futile launches #102

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

Prework

Related GitHub issues and pull requests

Summary

This PR implements a new launch_max argument to the controllers and launchers which sets the maximum number of allowed launch attempts in a row which do not complete any tasks. @shikokuchuo suggested this as a robust uniform solution to https://github.com/wlandau/crew.cluster/issues/19 and https://github.com/shikokuchuo/mirai/issues/20. Indeed it prevents a nightmare of indefinitely racking up costs on a malfunctioning platform, but given https://github.com/wlandau/crew/issues/79, launch_max can almost never be zero. So at best, this PR is only an approximate solution. I hope there is a more precise / less wasteful way to solve this long-term. Maybe it will have to involve understanding why some workers become backlogged in the first place and to prevent that outcome entirely.

FYI @multimeric

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (cd1f4e9) 100.00% compared to head (49f3039) 100.00%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #102 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 22 22 Lines 1261 1285 +24 ========================================= + Hits 1261 1285 +24 ``` | [Files Changed](https://app.codecov.io/gh/wlandau/crew/pull/102?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau) | Coverage Δ | | |---|---|---| | [R/crew\_controller\_local.R](https://app.codecov.io/gh/wlandau/crew/pull/102?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X2NvbnRyb2xsZXJfbG9jYWwuUg==) | `100.00% <100.00%> (ø)` | | | [R/crew\_launcher.R](https://app.codecov.io/gh/wlandau/crew/pull/102?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X2xhdW5jaGVyLlI=) | `100.00% <100.00%> (ø)` | | | [R/crew\_launcher\_local.R](https://app.codecov.io/gh/wlandau/crew/pull/102?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Will+Landau#diff-Ui9jcmV3X2xhdW5jaGVyX2xvY2FsLlI=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

shikokuchuo commented 1 year ago

Re. your comment on #79, whether a worker is 'backlogged' shouldn't affect the actual completed count - if a worker does a task this will increment regardless. I think you'd want to set it higher than 1 to allow for random errors though.

wlandau commented 1 year ago

By the way: @shikokuchuo, it's starting to look like https://github.com/shikokuchuo/mirai/commit/3f15eadc04045252d501772d519c54cecc583f0a really fixed https://github.com/wlandau/crew/issues/76 and https://github.com/ropensci/targets/discussions/1101! I almost always get lots of host segfaults and dangling dispatchers when I develop and test crew for a feature like this, but this time around all those problems went away! Not a single instance.

wlandau commented 1 year ago

Re. your comment on https://github.com/wlandau/crew/issues/79, whether a worker is 'backlogged' shouldn't affect the actual completed count - if a worker does a task this will increment regardless. I think you'd want to set it higher than 1 to allow for random errors though.

So then I can trust that backlogged workers will have at least 1 completed task? I thought if completed < assigned then it might be the case that completed = 0 when it should be 1 (or at least 1 task should have actually completed).

shikokuchuo commented 1 year ago

Re. your comment on #79, whether a worker is 'backlogged' shouldn't affect the actual completed count - if a worker does a task this will increment regardless. I think you'd want to set it higher than 1 to allow for random errors though.

So then I can trust that backlogged workers will have at least 1 completed task? I thought if completed < assigned then it might be the case that completed = 0 when it should be 1 (or at least 1 task should have actually completed).

Yes, it is hard to remember how things were... but now it's all cumulative stats. So if 'backlogged' as you've termed it - assigned will be 1 larger than complete. At most 1. All that means is there is a task waiting to be completed at the socket. If a daemon dials in and completes that task complete will increment. Conversely, if it wasn't backlogged, assigned = complete to start with, then it gets sent a task, it completes it, both increase. So whether it is 'backlogged' shouldn't affect this feature. I just wanted to be sure there wasn't something getting mixed up here.