xenon-middleware / xenon-cli

Perform files and jobs operations with Xenon library from command line
http://nlesc.github.io/Xenon/
Apache License 2.0
2 stars 3 forks source link

Socket timed out for Slurm adaptor #62

Open arnikz opened 6 years ago

arnikz commented 6 years ago

I've got the following error after submitting a couple of jobs to the queue. BTW the error occurs occasionally during builds on Travis CI.

slurm adaptor: Could not run command "scontrol" with stdin "null" arguments "[show, config]" at "nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler@142bb68f". Exit code = 1 Output:  Error output: slurm_load_ctl_conf error: Socket timed out on send/recv operation
Error submitting jobscript (exit code 1):
00:10:14.463 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
00:10:14.471 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
00:10:14.473 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
00:10:14.473 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-0
00:10:14.475 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
00:10:14.475 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
00:10:16.487 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
00:10:16.488 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
00:10:16.489 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 2016 ms, executable = scontrol, arguments = [show, config], exitcode = 1, stdout:
stderr:
slurm_load_ctl_conf error: Socket timed out on send/recv operation

Interestingly, after restarting the Travis build this error disappeared.

jmaassen commented 6 years ago

The socket timeout is actually produced by slurm itself, Xenon just reports is back.

Apparently there is some race condition that occurs every now and then where a part slurm is not yet running? (or has already disappeared).