wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

No need to use mirai::saisei() #136

Closed wlandau closed 10 months ago

wlandau commented 10 months ago

The most recent version of mirai locks dispatcher sockets, so only one worker can connect at a time. That means crew should no longer need to rotate websocket URLs (I plan to test this locally).

wlandau commented 10 months ago

Using {nanonext} 0.10.4.9000 and {mirai} 0.11.1.9000, I am finding I can part with saisei() just fine. But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at https://github.com/wlandau/crew/issues/31#issuecomment-1464013505. I guess we'll be depending on the dev versions for a while.

wlandau commented 10 months ago

On the bright side, eliminating saisei() dramatically reduces manual interactions with the dispatcher, which just might solve https://github.com/wlandau/crew/discussions/128 and user-end troubles like it.

wlandau commented 10 months ago

But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at https://github.com/wlandau/crew/issues/31#issuecomment-1464013505.

It is straightforward to reproduce this locally (R 4.3.2, Mac OS 14.1). In a host process, I call:

# terminal 1 R session (host)
mirai::daemons(n = 1L, url = "ws://127.0.0.1:5700")
#> [1] 1
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        0        0        0

Then I start a daemon process in a different terminal.

# terminal 2
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'

The host process sees this new daemon. So far so good.

# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      1        1        0        0

But now when I start a second daemon process in a different terminal, that daemon also connects.

# terminal 3
# The daemon R session stays running:
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'
# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        1        0        0

At this point, both daemons are running (terminals 2 and 3), and the online bit flipped back to 0. Interestingly, mirai::status()$connections still shows 1.

FYI @shikokuchuo

wlandau commented 10 months ago

I am rediscovering that I need to manually reset the instance counter each time I launch a {crew} worker. Otherwise, it's hard to reason about how many worker instances I am expecting and whether the "current" one completed. So I think crew attached to saisei() after all.

wlandau commented 10 months ago

Reopened to close as "not planned".

shikokuchuo commented 10 months ago

But with the latest CRAN releases (0.10.4 and 0.11.1), it appears that two daemons can connect to the same websocket, and that this flips the "online" bit back to 0 as predicted at #31 (comment).

It is straightforward to reproduce this locally (R 4.3.2, Mac OS 14.1). In a host process, I call:

# terminal 1 R session (host)
mirai::daemons(n = 1L, url = "ws://127.0.0.1:5700")
#> [1] 1
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        0        0        0

Then I start a daemon process in a different terminal.

# terminal 2
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'

The host process sees this new daemon. So far so good.

# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      1        1        0        0

But now when I start a second daemon process in a different terminal, that daemon also connects.

# terminal 3
# The daemon R session stays running:
R -e 'mirai::daemon(url = "ws://127.0.0.1:5700", asyncdial = FALSE)'
# terminal 1 R session (host)
mirai::status()
#> $connections
#> [1] 1
#> 
#> $daemons
#>                     i online instance assigned complete
#> ws://127.0.0.1:5700 1      0        1        0        0

At this point, both daemons are running (terminals 2 and 3), and the online bit flipped back to 0. Interestingly, mirai::status()$connections still shows 1.

FYI @shikokuchuo

Your reprex works as expected for me on Ubuntu 22.04.3, R 4.3.2. The second terminal instance exits straight away. The first one remains connected. With nanonext 0.10.4, mirai 0.11.1. But you report it working with the dev versions anyway so I'm not really concerned.

shikokuchuo commented 10 months ago

I am rediscovering that I need to manually reset the instance counter each time I launch a {crew} worker. Otherwise, it's hard to reason about how many worker instances I am expecting and whether the "current" one completed. So I think crew attached to saisei() after all.

I am not sure what is meant here, but it seems possible to create (duplicate to) another counter for this purpose, you don't actually need to manipulate the actual condition value inside the cv right? In my opinion, eliminating saisei() is not a small matter for your semi-transient workers, having to re-create the listeners is just pure overhead - I wouldn't be so quick to dismiss.

wlandau commented 10 months ago

crew needs a way to figure out if it needs to relaunch a worker. It uses the instance counter to help determine whether an instance of a worker successfully connected. If there are lurking instances in the background that could potentially connect, this creates ambiguity. That's why I need to rotate the websocket URL.

Say I launch a worker as an AWS Batch spot instance. It could take a few seconds or several minutes to connect to the host, depending pricing fluctuations on the cloud. crew's policy is to wait for some fixed time interval for it to start, then give up if it does not connect within that window. By "give up", I mean send a REST API request to cancel the job, and rotate the websocket URL. If crew does not rotate the websocket and the cancellation request arrives too late, then this could happen:

  1. crew sends the REST request to cancel the worker instance.
  2. To replace the first worker instance, crew launches a second one.
  3. The first worker connects in spite of the cancellation request.
  4. The REST cancellation request succeeds and the first worker abruptly terminates.
  5. The second worker connects.

Here, we would expect an instance count of 1, but it would actually be 2. In general, instance could be either 0, 1, or 2, depending on a hidden race condition. Rotating the websocket with saisei() excludes lurking worker instances and guarantees that crew can trust instance.

We had a similar discussion about this in #31 leading up to the implementation of saisei().

shikokuchuo commented 10 months ago

Thanks, that's very clear now. Locked sockets will ensure that only one daemon can connect at any one time, but it doesn't guarantee that the correct one does in your example above.