wlandau / crew.cluster

crew launcher plugins for traditional high-performance computing clusters
https://wlandau.github.io/crew.cluster
Other
27 stars 9 forks source link

Support SLURM #1

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

Hi @nviets, I just finished sketching an experimental SLURM plugin for crew.cluster. Would you be willing to try it out and see if it works for you? I wrote a couple test scripts at https://github.com/wlandau/crew.cluster/tree/main/tests/slurm. From your end, it would be great to confirm that the controller produced by [crew_controller_slurm()]() actually submits SLURM jobs and that crew tasks actually run on different nodes than the local node. The scripts in that folder are already written to test these cases. In addition, it would be great to know what you think about

wlandau commented 1 year ago

Closing this issue because the initial implementation of the plugin is up, and testing can happen whenever the right folks are available.

nviets commented 1 year ago

Having a look!

wlandau commented 1 year ago

Thanks @nviets! Also cc @brendanf.

nviets commented 1 year ago

Your example works great! How can I pass through my own slurm template? In clustermq, I would set options(clustermq.template = "slurm.tmpl") and pass through arguments for partitions, resources, and so on.

wlandau commented 1 year ago

Your example works great!

That's wonderful to hear!

How can I pass through my own slurm template? In clustermq, I would set options(clustermq.template = "slurm.tmpl") and pass through arguments for partitions, resources, and so on.

I am trying to get away from template files because I think they are confusing for most R users and inconvenient in most situations. I am opting for the formal slurm_* arguments in crew_controller_slurm(), with script_lines to input custom lines to the job script. You can see most of the lines that will be produced in the temporary job script using controller$launcher$script().

brendanf commented 1 year ago

My first attempt failed, but this is because of the inconvenient way my cluster has chosen to implement the R environment module, which results in sbatch being invisible from R. So I will need to make a separate container with R and all necessary packages in it... (I also have to do this for targets with clustermq, so I guess it was to be expected...) I will report back once I get this built, although I expect I will have the same success as @nviets.

wlandau commented 1 year ago

By default, crew_controller_slurm() detects sbatch from Sys.which("sbatch") and scancel from Sys.which("scancel"), but there are arguments command_submit and command_delete where you can supply the full paths to these programs. Would that be more convenient?

brendanf commented 1 year ago

The default R installation on the cluster runs inside a singularity container, which perversely does not have the external /usr/bin mapped, so it's really not possible to run sbatch from R. The only solution I've found is just to make my own container!

nviets commented 1 year ago

I see the slurm_* arguments for memory and cpus, but how can I set the partition? My clusters have a variety of partitions for different sets of servers and resources, e.g. batch/interactive and GPU machines. Also, how is the path to R inherited by workers? In clusters, I often have a variety of R versions installed in non-standard locations. I usually control the path via the tmpl file. Is R_HOME inherited by the workers - is it overridable?

The slurm templates can be a bit clunky, but they allow for complete interactions with the cluster.

brendanf commented 1 year ago

@nviets I think you can add additional #SBATCH ... lines to the script_lines argument of crew_controller_slurm() if you have additional arguments to pass to SLURM. This seems to work for me (I need to pass --account=...), although I am having some issues still.

It might be convenient to implement some helper to create these lines? Along the lines of:

controller <- crew_controller_slurm(
    ...
    slurm_args = list(
        account = "project_00001",
        `gres:nvme` = 100,
        M = "large"
    )
)

Which would produce in the script:

#SBATCH --account="project_00001"
#SBATCH --gres:nvme=100
#SBATCH -M="large"
brendanf commented 1 year ago

@wlandau For me, the launcher works and the worker job starts, but it apparently never does the task. During controller$wait(), each job gets canceled after 1 minute and a new one is submitted. I've modified the script to include both additional SBATCH arguments, and to print the worker logs:

controller <- crew_controller_slurm(
  name = "my_workflow",
  workers = 1L,
  seconds_idle = 300,
  script_lines = c(
    "#SBATCH --account={my_project_id}"
    # I don't need to load the R environment module, my custom R is on PATH, which is inherited by the worker
  ),
  verbose = TRUE,
  slurm_log_output = "crew_log_%A.txt"
)

The worker logs show:

> crew::crew_worker(settings = list(url = "ws://10.140.128.243:38007/107689e0026c9c7bf1d2f7a8742fe1043626c34a", maxtasks = Inf, idletime = 3e+05, walltime = Inf, timerstart = 0L, exitlinger = 100, cleanup = 1L, asyncdial = FALSE), launcher = "my_workflow", worker = 1L, instance = "107689e0026c9c7bf1d2f7a8742fe1043626c34a")
wlandau commented 1 year ago

I see the slurm_* arguments for memory and cpus, but how can I set the partition? My clusters have a variety of partitions for different sets of servers and resources, e.g. batch/interactive and GPU machines. Also, how is the path to R inherited by workers? In clusters, I often have a variety of R versions installed in non-standard locations. I usually control the path via the tmpl file. Is R_HOME inherited by the workers - is it overridable?

I agree with @brendanf. Memory and CPUs have explicit formal arguments because they are common, and everything else can be handled through script_lines, which puts those lines in the job script. script_lines is also where you load R, however it needs to be loaded. On my cluster, I need to set script_lines = "module load R". For you, it may be different, e.g. a matter of exporting environment variables (although a project-level .Renviron file might be a better alternative).

wlandau commented 1 year ago

For me, the launcher works and the worker job starts, but it apparently never does the task. During controller$wait(), each job gets canceled after 1 minute and a new one is submitted.

Sounds like the worker is having trouble dialing into the client. You could confirm this by calling controller$summary() and checking if there is a worker online/connected.

Maybe your container setup is making the network programming more difficult.crew::crew_workers() has url = ws://10.140.128.243:38007/..., which means it is trying to reach port 38007 at IP address 10.140.128.243. 10.140.128.243 is the local IP address you see from running getip::getip() from the R session where you call controller$start(). The Singularity container may have a different IP masking this one, and so when the worker tries to connect back, it may not be able to find 10.140.128.243.

I don't know enough about Docker or Singularity to make a recommendation, but I hear these tools are network programming-aware and there must be a way to expose IPs and ports to the local network.

But on the other hand, clustermq works the same way, and if it works, then my guess above might not be right.

wlandau commented 1 year ago

You could do a more lightweight test with mirai alone and not crew. On the local R process on a login node, run:

library(mirai)
daemons(
  n = 1L,
  url = "ws://10.140.128.243:38007"
  dispatcher = TRUE,
  token = TRUE
)

Then call daemons()$daemons to get the url, which should have a long token the end of it. While the mirai dispatcher is running, open a different node interactively and launch a mirai server:

library(mirai)
launch_server(url = "URL_FROM_DAEMONS")

When that server starts running, you should be see the "online" column equal to 1 when you call daemons()$daemons on the login node where you ran daemons() first.

brendanf commented 1 year ago

The mirai-only test also fails for me. Can you think of any other implementation details that might distinguish this case from clustermq, which does work?

brendanf commented 1 year ago

I looked a bit more in the worker log of a clustermq worker, and found that it was referring to the main machine by host name rather than IP. I re-tried the mirai-only test using url = paste0("ws://", Sys.info()["nodename"], ":38007") and now it works.

Then I tried adding host=Sys.info()["nodename"] to the crew_controller_slurm() call, and now both of your test scripts execute with no problem. So apparently this is some issue with the IP returned by getip::getip(), potentially resulting from the fact I'm running inside a Singularity container.

wlandau commented 1 year ago

So glad you figured it out!

Odd that getip::getip(type = "local") returns a hostname and Sys.info()["nodename"] returns an IP address. It's the complete opposite on the machines I have access to.

I searched everywhere, but beyond getip::getip(), I cannot seem to find a universal or reliable way to get the local IP address of the correct network interface of the local machine. I would love to find a better way, and I would love to be able to recommend something for IPv6.

@mschubert, how does clustermq get the local IP address that the workers connect back to?

brendanf commented 1 year ago

@wlandau I think I wasn't clear; Clustermq (which has always worked for me) uses a host name, while mirai was using an IP. So I tried using the host name instead of the IP in mirai, and now that works too. For some reason my setup works with a host name, but not an IP. Presumably this is actually because I can get the correct host name from Sys.info()["nodename"], but I cannot get the correct IP from getip::getip(type = "local").

wlandau commented 1 year ago

I see, that makes sense. I wonder, is Sys.info()["nodename"] safe in the general case on a local network? Because if so, I think I would like to use it instead of getip::getip() as a crew default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets? @shikokuchuo, is this a robust enough way to construct NNG websockets?

shikokuchuo commented 1 year ago

Technically from nanonext/NNG's perspective both numeric IPs or hostnames are equally valid/supported for both TCP and websockets. The only thing I can think of is that it does require a DNS lookup.

library(nanonext)
?transports

Not sure you want to drop getip() entirely though as with what's coming, crew may not need to be limited to a local network :)

shikokuchuo commented 1 year ago

Also from experience Sys.info()["nodename"] is not safe for a local network esp. for custom configured linux distros. The "just-works" case assumes there is something like avahi-daemon installed - from what I remember... this was some time ago.

mschubert commented 1 year ago

I see, that makes sense. I wonder, is Sys.info()["nodename"] safe in the general case on a local network? Because if so, I think I would like to use it instead of getip::getip() as a crew default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets?

I found Sys.info()["nodename"] to be more reliable because it will (usually) resolve the host on all local network interfaces, while the IP address is specific to one interface. But when you have the right interface, the IP should always work if the hostname does.

The only issues I saw were (1) when localhost did not resolve to 127.0.0.1 (hence clustermq uses IP for multicore), and (2) when two interfaces are resolved but one did not allow incoming connections (hence we've got the clustermq.host option to set an interface manually)

shikokuchuo commented 1 year ago

@wlandau just FYI as I am only starting to experiment, but it seems using the hostname may make it easier for authenticating TLS certificates.

wlandau commented 1 year ago

Thanks, that's useful to know. Maybe a solution to https://github.com/wlandau/crew/issues/74, depending on what else your experiments find?

shikokuchuo commented 1 year ago

Actually an IP address works equally well, we can choose either if we're creating a certificate ourselves.

It might just be slightly more compatible with existing certificates, e.g. one generated by openssl previously, which is more likely to use a hostname than an IP.

cfljam commented 1 year ago

Thanks Will for another great addition to the targets-verse I am able to run my test pipeline on SLURM using crew_controller_slurm only when I set a required walltime flag for SBATCH via

script_lines = c(
      "#SBATCH --time=00:30:00")

It would be much simpler to tune requirements for heterogeneous targets if a slurm_walltime argument was provided so this could be set as for the companion arguments slurm_memory_gigabytes_per_cpu and slurm_cpus_per_task

wlandau commented 1 year ago

Good idea, the (few) SLURM clusters I have used in the past seem to require an explicitly set wall time. (@brendanf and @nviets, please correct me if I am wrong on this.) Implemented in efc4b5ab6c2718155da2b5e2284531b28f7efd87.

brendanf commented 1 year ago

Agreed, this is helpful. I believe my cluster does let me submit a job without an explicit walltime, but the default is 1 minute, which is rarely useful in a real setting (although it was good enough for the tests earlier in this thread).

Incidentally, does the crew controller recover and resubmit jobs if a worker is killed by slurm due to time-out (or anything else)? I guess this would rely on the connection between the worker and the controller timing out?

wlandau commented 1 year ago

Agreed, this is helpful. I believe my cluster does let me submit a job without an explicit walltime, but the default is 1 minute, which is rarely useful in a real setting (although it was good enough for the tests earlier in this thread).

Thanks! What would be a good default value? I currently set it to 1 day, but could adjust.

Incidentally, does the crew controller recover and resubmit jobs if a worker is killed by slurm due to time-out (or anything else)? I guess this would rely on the connection between the worker and the controller timing out?

It should, and yes it relies on the network connections. crew uses counters in mirai to detect if a worker did not complete all its assigned tasks, and it uses that information to resubmit the worker to run its backlog and potentially new tasks.

brendanf commented 1 year ago

What would be a good default value? I currently set it to 1 day, but could adjust.

I think that is an ok default; it is long enough that many basic pipelines will be able to complete (or at least the worker will complete some targets before it times out), but also within the limits for the default partition on all the Slurm clusters I have used. Users who know that they don't need their workers to live that long should give a smaller value, because shorter jobs may get scheduled faster if the cluster utilization is high, but that is made easy by providing it as an argument.

brendanf commented 9 months ago

I see, that makes sense. I wonder, is Sys.info()["nodename"] safe in the general case on a local network? Because if so, I think I would like to use it instead of getip::getip() as a crew default. It would solve issues like yours, and it would allow me to drop a package dependency. @mschubert, have you had any issues using the human-readable hostname for TCP sockets?

I found Sys.info()["nodename"] to be more reliable because it will (usually) resolve the host on all local network interfaces, while the IP address is specific to one interface. But when you have the right interface, the IP should always work if the hostname does.

The only issues I saw were (1) when localhost did not resolve to 127.0.0.1 (hence clustermq uses IP for multicore), and (2) when two interfaces are resolved but one did not allow incoming connections (hence we've got the clustermq.host option to set an interface manually)

While troubleshooting other issues, I have been able to confirm that (2) is what is going on for me. Nodes on my cluster have both ethernet and infiniband. getip::getip() returns the address of the ethernet adapter, but they can only talk to each other via the infiniband. If I look up the IP address of the infiniband manually and pass it as host to crew_slurm_controller() then the tests work.

mschubert commented 9 months ago

If I look up the IP address of the infiniband manually and pass it as host

Note that for any clustermq backend you can also use the interface name as host, e.g. ib0 for a previous infiniband I was using (check ifconfig or ip addr)

This is probably similar for mirai-backed workers

brendanf commented 9 months ago

@mschubert Sadly that doesn't seem to work in mirai; crew_controller_slurm() fails with initial sync with dispatcher timed out after 10s.