xenon-middleware / xenon

A middleware abstraction library that provides a simple programming interface to various compute and storage resources.
http://xenon-middleware.github.io/xenon/
Apache License 2.0
34 stars 17 forks source link

Slurm adaptor getJobStatus fails in the wrong way when SSH connection is lost #668

Closed jmaassen closed 4 years ago

jmaassen commented 4 years ago

When getJobStatus is called on the slurm adaptor it will execute up to 3 different slurm commands in an attempt to find the job: squeue, sinfo and sactt. It sends these commands using an interactive job on a subscheduler such as SSH. If the first command does not produce a result it tries the next, etc.

However, if the ssh connection is down, the first command will produce an exception instead of a result. The slurm adaptor will then print a debug message (which is ignored by default), and goes on to try the next command. This command will again produce an exception, etc.

When all commands are tried and there is no result, a NoSuchJobException is thrown, regardless of whether the slurm commands executed correctly (but without finding the job) or incorrectly.

As a result, client applications such as xenon-flow can not see the difference between a job that cannot be found or losing the underlying SSH connection completely.

This is incorrect behavior. Instead, the NoSuchJobException should only be thrown if the slurm commands where executed successfully, but the job could not be found. When the commands fail to run, a XenonException should be thrown.

In addition, the debug messages explaining why the commands failed may be printed as a warning instead of debug?

jmaassen commented 4 years ago

We should also check the other scripting adaptors if they show the same incorrect behavior.

jmaassen commented 4 years ago

Fixed in 3.1.0 release