Slurm adaptor got invalid key/value pair in output #72

Closed arnikz closed 4 years ago

arnikz commented 4 years ago

The GridEngine cluster at UMCU has been recently upgraded to use Slurm (v19) (and will replace GE soon-ish). So, I tested the sv-callers workflow but all Slurm jobs failed (also tried without the --max-memory arg, see release notes).

xenon -vvv scheduler slurm --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 5 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log
slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
Error submitting jobscript (exit code 1):
13:18:55.487 [main] DEBUG n.e.x.a.s.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
13:18:55.498 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
13:18:55.501 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job
13:18:55.506 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Created Job local-0
13:18:55.507 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
13:18:55.508 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.s.RemoteCommandRunner - CommandRunner took 44 ms, executable = scontrol, arguments = [show, config], exitcode = 0, stdout:
arnikz commented 4 years ago

Perhaps, it's a good time to update the Docker images.

arnikz commented 4 years ago

@sverhoeven: could you given an estimate on how much time is required to fix this? Thanks.

sverhoeven commented 4 years ago

The ScriptingParser used in the SlurmScheduler class does not know about sections (:). I think it would take at least a day to write a robust parser and couple of hours to create a new Xenon and Xenon-* releases.

jmaassen commented 4 years ago

I've just added a single statement to ignore all lines without an = sign in them.

jmaassen commented 4 years ago

I'll add some tests and release a version with this fix

jmaassen commented 4 years ago

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

We would need a slurm 19 container to do proper testing?

arnikz commented 4 years ago

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.


We would need a slurm 19 container to do proper testing?


arnikz commented 4 years ago

Hi, I've tested my workflow with xenon-cli 3.0.5beta1 + new slurm-19 image but the jobs are still failing. Please heeelp!

sverhoeven commented 4 years ago

The conda xenon-cli 3.0.5beta1 package was just made for non-linux users (#73).

It does not include the fix in the https://github.com/xenon-middleware/xenon/tree/jobstatus-bug branch, it is a build with the Xenon v3.0.4 release.

jmaassen commented 4 years ago

Hmmm... my (new) unit test does parse the output correctly.

I think there may be some version mixup with xenon somewhere. I'll see if I can find the problem.

update: Ah, it seems the fix may be in the jobstatus-bug branch ;-)

jmaassen commented 4 years ago

I'll cleanup the branch and test it with the other (non-slurm) scripting adaptors. I can then merge it into master and make a new release

sverhoeven commented 4 years ago

I created a draft PR https://github.com/xenon-middleware/xenon/pull/670 for the jobstatus-bug branch, to see the test failures more easily.

jmaassen commented 4 years ago

Hmmm... most of the test pass, except for one integration test. Apparently the sbatch argument "--workdir" has changed to "--chdir" at some point. Will fix.

jmaassen commented 4 years ago

Fixed in the 3.1.0 release

sverhoeven commented 4 years ago

CLI v3.0.5 released on conda with Xenon 3.1.0. Please test

arnikz commented 4 years ago

All works fine with the latest release on Slurm 19. Thanks!