Closed wlandau closed 1 year ago
I thought my update for 0.8.7.9003 would have fixed this. At least you are just getting the timeout (which is intentional) and not the other errors/warnings I assume. Any ideas to reproduce?
There is an existing 3 sec timeout if nothing comes back from dispatcher. This is to prevent hanging. I have not made this tuneable as 3 secs should be plenty as there is not much to send and it is a local process using fast IPC sockets. If it does indeed produce this timeout, I'd assume dispatcher died. Unless that's not the case?
EDIT: I've just pushed v0.8.7.9004. This finishes the job of the previous build, which was quite hastily put together yesterday. I don't think this would address the issue, but just in case... it might well have fixed it?
Again, I really appreciate your work on this! I just tried 0.8.7.9004 on the example from https://github.com/wlandau/crew/issues/75#issue-1709009643 several times, and the timeout happens far less often. However, I still see occurrences on my Ubuntu machine. One time I ran into it when the example was just starting:
# [1] 1
# Error:
# ! invalid daemons: structure(5L, class = "errorValue")
And another time after all tasks were pushed and were in the process of being popped:
...
[1] "3d7e364e8c1e7994ba448da75798b86d3d12df1e"
[1] "cd08685844d62fac3c213ee1cd643b7bd6906704"
[1] "e48993ae1e1dbad82ecd1697955c579becb2e471"
[1] "9fceae13b365a397000b4b4cae7ac70d15fc79ef"
[1] "fe58a2ffa1522d6136df58c6cee392608b0ca36f"
[1] "82f03a6a32f4b31f4dc979f0947fd12617754a84"
Error:
! invalid daemons: structure(5L, class = "errorValue")
In the latter case, there was only 1 task left in the crew
queue, and all 5999 available results were already popped. Calling $pop()
once to start a server and then again to return the result took care of the last task. All the while, the dispatcher never stopped running. I could see it on htop
, and I confirmed the PID matches the one from rownames(daemons()$daemons)
.
Btw. this is not Ubuntu-specific. I have not managed to get it to reproduce once on my Ubuntu laptop. But given you can get this on an Ubuntu machine, it is a real problem.
From what you're describing, this seems to point to possibly something like daemons()
being called in a spin loop - and as it happens at the start and end, possibly to do with its use in scaling.
You should consider adding msleep(1L)
or even msleep(10L)
calls where appropriate. There is a reason Garrett D'Amore provided this utility as part of NNG and is a legitimate use where you can't solve things though synchronisation. If you are calling daemons()
constantly, it also prevents dispatcher from doing real work.
Reframing the problem - I am not seeing a good reason why a daemons()
request should get stuck. That's not to say that it's not possible, and factually it seems it does get stuck say 0.1% of the time, but the other 99.9% of the time it works as expected. If crew
does not probe the 0.1% of cases where it does get stuck, this is likely to be a more sustainable solution going forward.
Also I made a couple of small tweaks in v0.8.7.9005. It would be great to know if these are enough to eliminate the timeouts!
I've finally seen this:
Error:
! invalid daemons: structure(5L, class = "errorValue")
when I reverted the control socket back to "bus"! I'm taking a slightly different approach that should prevent these altogether.
In mirai 3c04d04 v0.8.7.9006 I've re-implemented the dispatcher control socket.
This eliminates the timeouts for me, even though it still only uses the same lightweight "bus" sockets. The issue is addressed through synchronisation techniques instead.
However, as I was not able to produce the timeouts with the req/rep sockets before, I still need you to confirm that it works on your Ubuntu machine.
That certainly improves things! I can no longer reproduce the timeouts locally on Ubuntu (at least they are very rare) and I only got one timeout on GitHub actions: https://github.com/ropensci/targets/actions/runs/5006076955.
Meanwhile, I implemented basic throttling in crew
's auto-scaling to be nicer to the dispatcher. (Auto-scaling does not need to happen again if it already took place e.g. half a second ago.) This also has the effect of doubling the speed of my tests, particularly the one at the top of #75. So that should make timeouts even less likely.
Currently testing on GitHub Actions.
Not to propose a new direction in development, but this does get me thinking big-picture about the mirai
/crew
integration. mirai
has a dispatcher as a separate process which frees the user session. But in crew
, because of auto-scaling and transient workers, the main R process needs to be an event loop that continuously calls push()
, pop()
, or wait()
(similar to Gabor's task queue idea for callr
). The crew
event loop is even more important now that there is throttling. All this makes me wonder if crew
itself could manually run an iteration of dispatch each time e.g. pop()
is called, rather than overburdening the mirai
dispatcher. It might not be feasible due to the scope and intent of mirai
, but I keep wondering about this.
Unfortunately, I still got instances of 'errorValue' int 7 | Object closed
for the task in the GitHub Actions targets
tests. Example: https://github.com/ropensci/targets/actions/runs/5006290328/jobs/8971368720#step:9:257
Unfortunately, I still got instances of
'errorValue' int 7 | Object closed
for the task in the GitHub Actionstargets
tests. Example: https://github.com/ropensci/targets/actions/runs/5006290328/jobs/8971368720#step:9:257
This is new to me. Have you seen this recently? Would be good to see when it started popping up. I'm not immediately sure what this is about.
Sorry - how am I meant to read the logs - is 7 |Object closed
the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons()
call?
Perhaps we should open a new issue to prevent this one from getting too long.
I will open a new issue, sorry to throw so many things here at once.
Just posted #77.
The specific "'errorValue' int 5" messages started to creep back just after I implemented #87. At first I only saw them intermittently on GitHub Actions Mac OS runners, but today I saw it for the first time on my local MacBook during R CMD check. The local instance happened here in a test, which traces back to launcher$tally()
and ultimately to mirai::daemons()
. Good news though: I added more detail to the error message so we can find out if the dispatcher is still alive when this happens.
If the dispatcher is still running when the error happens, then I might be able to implement a retry/timeout mechanism in crew
as a workaround. Wouldn't work if the dispatcher already exited, but if that's the case, maybe the root cause would be easier to solve.
From https://github.com/wlandau/crew/actions/runs/5410657252/jobs/9832405357#step:8:167, it looks like the dispatcher is actually still running when the daemons() error happens. So maybe I should try implementing retries.
I'm not sure, but https://github.com/shikokuchuo/mirai/issues/64 might be a clue as to what might be happening. If the dispatcher is busy/called the moment a server dials in, I wonder if that would cause problems. Unfortunately crew
needs asyncdial = FALSE
as a safeguard to make sure workers shut down when they need to shut down.
I don't think https://github.com/shikokuchuo/mirai/issues/64 is related.
Is there an actual deadlock here at dispatcher i.e. if you just record but otherwise ignore the error at the client and retry/let things continue, does the pipeline finish?
Is there an actual deadlock here at dispatcher i.e. if you just record but otherwise ignore the error at the client and retry/let things continue, does the pipeline finish?
I don't know, actually. I implemented retries a day or 2 ago, but I haven't had time to really test it. Currently testing in branches with-retries
and without-retries
in this repo.
I tested 200 builds with retries and 200 without, and only one build had connection issues (in the latter). I expect more issues without retries, but regardless, something seems to have fixed this.
Related: #75, https://github.com/shikokuchuo/mirai/issues/53. Happens both with and without
targets
, although most often on GitHub Actionstargets
R CMD check and intargets
pipelines on my local Ubuntu machine.I am starting to think this issue is specific to Ubuntu. My local Ubuntu installation:
And
ubuntu-latest
on GitHub Actions is also 22.04.I wrote a
Dockerfile
at https://github.com/ropensci/targets/blob/debug-crew/Dockerfile to get set up for reproducible examples on Ubuntu, included below for reference. (I suppose the version inFROM ubuntu:20.04
might need to change.)@shikokuchuo, which systems do you test with?