Closed wlandau closed 1 year ago
Closing this issue because the initial implementation of the plugin is up, and testing can happen whenever the right folks are available.
Having a look!
Thanks @nviets! Also cc @brendanf.
Your example works great! How can I pass through my own slurm template? In clustermq, I would set options(clustermq.template = "slurm.tmpl")
and pass through arguments for partitions, resources, and so on.
Your example works great!
That's wonderful to hear!
How can I pass through my own slurm template? In clustermq, I would set options(clustermq.template = "slurm.tmpl") and pass through arguments for partitions, resources, and so on.
I am trying to get away from template files because I think they are confusing for most R users and inconvenient in most situations. I am opting for the formal slurm_*
arguments in crew_controller_slurm()
, with script_lines
to input custom lines to the job script. You can see most of the lines that will be produced in the temporary job script using controller$launcher$script()
.
My first attempt failed, but this is because of the inconvenient way my cluster has chosen to implement the R environment module, which results in sbatch
being invisible from R. So I will need to make a separate container with R and all necessary packages in it... (I also have to do this for targets
with clustermq
, so I guess it was to be expected...) I will report back once I get this built, although I expect I will have the same success as @nviets.
By default, crew_controller_slurm()
detects sbatch
from Sys.which("sbatch")
and scancel
from Sys.which("scancel")
, but there are arguments command_submit
and command_delete
where you can supply the full paths to these programs. Would that be more convenient?
The default R installation on the cluster runs inside a singularity container, which perversely does not have the external /usr/bin
mapped, so it's really not possible to run sbatch
from R. The only solution I've found is just to make my own container!
I see the slurm_*
arguments for memory and cpus, but how can I set the partition? My clusters have a variety of partitions for different sets of servers and resources, e.g. batch/interactive and GPU machines. Also, how is the path to R inherited by workers? In clusters, I often have a variety of R versions installed in non-standard locations. I usually control the path via the tmpl file. Is R_HOME inherited by the workers - is it overridable?
The slurm templates can be a bit clunky, but they allow for complete interactions with the cluster.
@nviets I think you can add additional #SBATCH ...
lines to the script_lines
argument of crew_controller_slurm()
if you have additional arguments to pass to SLURM. This seems to work for me (I need to pass --account=...
), although I am having some issues still.
It might be convenient to implement some helper to create these lines? Along the lines of:
controller <- crew_controller_slurm(
...
slurm_args = list(
account = "project_00001",
`gres:nvme` = 100,
M = "large"
)
)
Which would produce in the script:
#SBATCH --account="project_00001"
#SBATCH --gres:nvme=100
#SBATCH -M="large"
@wlandau For me, the launcher works and the worker job starts, but it apparently never does the task. During controller$wait()
, each job gets canceled after 1 minute and a new one is submitted. I've modified the script to include both additional SBATCH arguments, and to print the worker logs:
controller <- crew_controller_slurm(
name = "my_workflow",
workers = 1L,
seconds_idle = 300,
script_lines = c(
"#SBATCH --account={my_project_id}"
# I don't need to load the R environment module, my custom R is on PATH, which is inherited by the worker
),
verbose = TRUE,
slurm_log_output = "crew_log_%A.txt"
)
The worker logs show:
> crew::crew_worker(settings = list(url = "ws://10.140.128.243:38007/107689e0026c9c7bf1d2f7a8742fe1043626c34a", maxtasks = Inf, idletime = 3e+05, walltime = Inf, timerstart = 0L, exitlinger = 100, cleanup = 1L, asyncdial = FALSE), launcher = "my_workflow", worker = 1L, instance = "107689e0026c9c7bf1d2f7a8742fe1043626c34a")
I see the slurm_* arguments for memory and cpus, but how can I set the partition? My clusters have a variety of partitions for different sets of servers and resources, e.g. batch/interactive and GPU machines. Also, how is the path to R inherited by workers? In clusters, I often have a variety of R versions installed in non-standard locations. I usually control the path via the tmpl file. Is R_HOME inherited by the workers - is it overridable?
I agree with @brendanf. Memory and CPUs have explicit formal arguments because they are common, and everything else can be handled through script_lines
, which puts those lines in the job script. script_lines
is also where you load R, however it needs to be loaded. On my cluster, I need to set script_lines = "module load R"
. For you, it may be different, e.g. a matter of exporting environment variables (although a project-level .Renviron
file might be a better alternative).
For me, the launcher works and the worker job starts, but it apparently never does the task. During controller$wait(), each job gets canceled after 1 minute and a new one is submitted.
Sounds like the worker is having trouble dialing into the client. You could confirm this by calling controller$summary()
and checking if there is a worker online/connected.
Maybe your container setup is making the network programming more difficult.crew::crew_workers()
has url = ws://10.140.128.243:38007/...
, which means it is trying to reach port 38007 at IP address 10.140.128.243. 10.140.128.243 is the local IP address you see from running getip::getip() from the R session where you call controller$start()
. The Singularity container may have a different IP masking this one, and so when the worker tries to connect back, it may not be able to find 10.140.128.243.
I don't know enough about Docker or Singularity to make a recommendation, but I hear these tools are network programming-aware and there must be a way to expose IPs and ports to the local network.
But on the other hand, clustermq
works the same way, and if it works, then my guess above might not be right.
You could do a more lightweight test with mirai
alone and not crew
. On the local R process on a login node, run:
library(mirai)
daemons(
n = 1L,
url = "ws://10.140.128.243:38007"
dispatcher = TRUE,
token = TRUE
)
Then call daemons()$daemons
to get the url, which should have a long token the end of it. While the mirai
dispatcher is running, open a different node interactively and launch a mirai
server:
library(mirai)
launch_server(url = "URL_FROM_DAEMONS")
When that server starts running, you should be see the "online" column equal to 1 when you call daemons()$daemons
on the login node where you ran daemons()
first.
The mirai-only test also fails for me. Can you think of any other implementation details that might distinguish this case from clustermq, which does work?
I looked a bit more in the worker log of a clustermq worker, and found that it was referring to the main machine by host name rather than IP. I re-tried the mirai-only test using url = paste0("ws://", Sys.info()["nodename"], ":38007")
and now it works.
Then I tried adding host=Sys.info()["nodename"]
to the crew_controller_slurm()
call, and now both of your test scripts execute with no problem. So apparently this is some issue with the IP returned by getip::getip()
, potentially resulting from the fact I'm running inside a Singularity container.
So glad you figured it out!
Odd that getip::getip(type = "local")
returns a hostname and Sys.info()["nodename"]
returns an IP address. It's the complete opposite on the machines I have access to.
I searched everywhere, but beyond getip::getip()
, I cannot seem to find a universal or reliable way to get the local IP address of the correct network interface of the local machine. I would love to find a better way, and I would love to be able to recommend something for IPv6.
@mschubert, how does clustermq
get the local IP address that the workers connect back to?
@wlandau I think I wasn't clear; Clustermq (which has always worked for me) uses a host name, while mirai was using an IP. So I tried using the host name instead of the IP in mirai, and now that works too. For some reason my setup works with a host name, but not an IP. Presumably this is actually because I can get the correct host name from Sys.info()["nodename"]
, but I cannot get the correct IP from getip::getip(type = "local")
.
I see, that makes sense. I wonder, is Sys.info()["nodename"]
safe in the general case on a local network? Because if so, I think I would like to use it instead of getip::getip()
as a crew
default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets? @shikokuchuo, is this a robust enough way to construct NNG websockets?
Technically from nanonext/NNG's perspective both numeric IPs or hostnames are equally valid/supported for both TCP and websockets. The only thing I can think of is that it does require a DNS lookup.
library(nanonext)
?transports
Not sure you want to drop getip()
entirely though as with what's coming, crew
may not need to be limited to a local network :)
Also from experience Sys.info()["nodename"]
is not safe for a local network esp. for custom configured linux distros. The "just-works" case assumes there is something like avahi-daemon
installed - from what I remember... this was some time ago.
I see, that makes sense. I wonder, is
Sys.info()["nodename"]
safe in the general case on a local network? Because if so, I think I would like to use it instead ofgetip::getip()
as acrew
default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets?
I found Sys.info()["nodename"]
to be more reliable because it will (usually) resolve the host on all local network interfaces, while the IP address is specific to one interface. But when you have the right interface, the IP should always work if the hostname does.
The only issues I saw were (1) when localhost
did not resolve to 127.0.0.1
(hence clustermq
uses IP for multicore), and (2) when two interfaces are resolved but one did not allow incoming connections (hence we've got the clustermq.host
option to set an interface manually)
@wlandau just FYI as I am only starting to experiment, but it seems using the hostname may make it easier for authenticating TLS certificates.
Thanks, that's useful to know. Maybe a solution to https://github.com/wlandau/crew/issues/74, depending on what else your experiments find?
Actually an IP address works equally well, we can choose either if we're creating a certificate ourselves.
It might just be slightly more compatible with existing certificates, e.g. one generated by openssl previously, which is more likely to use a hostname than an IP.
Thanks Will for another great addition to the targets-verse I am able to run my test pipeline on SLURM using crew_controller_slurm only when I set a required walltime flag for SBATCH via
script_lines = c(
"#SBATCH --time=00:30:00")
It would be much simpler to tune requirements for heterogeneous targets if a slurm_walltime argument was provided so this could be set as for the companion arguments slurm_memory_gigabytes_per_cpu and slurm_cpus_per_task
Good idea, the (few) SLURM clusters I have used in the past seem to require an explicitly set wall time. (@brendanf and @nviets, please correct me if I am wrong on this.) Implemented in efc4b5ab6c2718155da2b5e2284531b28f7efd87.
Agreed, this is helpful. I believe my cluster does let me submit a job without an explicit walltime, but the default is 1 minute, which is rarely useful in a real setting (although it was good enough for the tests earlier in this thread).
Incidentally, does the crew controller recover and resubmit jobs if a worker is killed by slurm due to time-out (or anything else)? I guess this would rely on the connection between the worker and the controller timing out?
Agreed, this is helpful. I believe my cluster does let me submit a job without an explicit walltime, but the default is 1 minute, which is rarely useful in a real setting (although it was good enough for the tests earlier in this thread).
Thanks! What would be a good default value? I currently set it to 1 day, but could adjust.
Incidentally, does the crew controller recover and resubmit jobs if a worker is killed by slurm due to time-out (or anything else)? I guess this would rely on the connection between the worker and the controller timing out?
It should, and yes it relies on the network connections. crew uses counters in mirai to detect if a worker did not complete all its assigned tasks, and it uses that information to resubmit the worker to run its backlog and potentially new tasks.
What would be a good default value? I currently set it to 1 day, but could adjust.
I think that is an ok default; it is long enough that many basic pipelines will be able to complete (or at least the worker will complete some targets before it times out), but also within the limits for the default partition on all the Slurm clusters I have used. Users who know that they don't need their workers to live that long should give a smaller value, because shorter jobs may get scheduled faster if the cluster utilization is high, but that is made easy by providing it as an argument.
I see, that makes sense. I wonder, is
Sys.info()["nodename"]
safe in the general case on a local network? Because if so, I think I would like to use it instead ofgetip::getip()
as acrew
default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets?I found
Sys.info()["nodename"]
to be more reliable because it will (usually) resolve the host on all local network interfaces, while the IP address is specific to one interface. But when you have the right interface, the IP should always work if the hostname does.The only issues I saw were (1) when
localhost
did not resolve to127.0.0.1
(henceclustermq
uses IP for multicore), and (2) when two interfaces are resolved but one did not allow incoming connections (hence we've got theclustermq.host
option to set an interface manually)
While troubleshooting other issues, I have been able to confirm that (2) is what is going on for me. Nodes on my cluster have both ethernet and infiniband. getip::getip()
returns the address of the ethernet adapter, but they can only talk to each other via the infiniband. If I look up the IP address of the infiniband manually and pass it as host
to crew_slurm_controller()
then the tests work.
If I look up the IP address of the infiniband manually and pass it as
host
Note that for any clustermq
backend you can also use the interface name as host
, e.g. ib0
for a previous infiniband I was using (check ifconfig
or ip addr
)
This is probably similar for mirai
-backed workers
@mschubert Sadly that doesn't seem to work in mirai
; crew_controller_slurm()
fails with initial sync with dispatcher timed out after 10s
.
Hi @nviets, I just finished sketching an experimental SLURM plugin for
crew.cluster
. Would you be willing to try it out and see if it works for you? I wrote a couple test scripts at https://github.com/wlandau/crew.cluster/tree/main/tests/slurm. From your end, it would be great to confirm that the controller produced by [crew_controller_slurm()
]() actually submits SLURM jobs and thatcrew
tasks actually run on different nodes than the local node. The scripts in that folder are already written to test these cases. In addition, it would be great to know what you think about