When getJobStatus is called on the slurm adaptor it will execute up to 3 different slurm commands in an attempt to find the job: squeue, sinfo and sactt. It sends these commands using an interactive job on a subscheduler such as SSH. If the first command does not produce a result it tries the next, etc.
However, if the ssh connection is down, the first command will produce an exception instead of a result. The slurm adaptor will then print a debug message (which is ignored by default), and goes on to try the next command. This command will again produce an exception, etc.
When all commands are tried and there is no result, a NoSuchJobException is thrown, regardless of whether the slurm commands executed correctly (but without finding the job) or incorrectly.
As a result, client applications such as xenon-flow can not see the difference between a job that cannot be found or losing the underlying SSH connection completely.
This is incorrect behavior. Instead, the NoSuchJobException should only be thrown if the slurm commands where executed successfully, but the job could not be found. When the commands fail to run, a XenonException should be thrown.
In addition, the debug messages explaining why the commands failed may be printed as a warning instead of debug?
When
getJobStatus
is called on the slurm adaptor it will execute up to 3 different slurm commands in an attempt to find the job:squeue
,sinfo
andsactt
. It sends these commands using an interactive job on a subscheduler such as SSH. If the first command does not produce a result it tries the next, etc.However, if the ssh connection is down, the first command will produce an exception instead of a result. The slurm adaptor will then print a debug message (which is ignored by default), and goes on to try the next command. This command will again produce an exception, etc.
When all commands are tried and there is no result, a
NoSuchJobException
is thrown, regardless of whether the slurm commands executed correctly (but without finding the job) or incorrectly.As a result, client applications such as xenon-flow can not see the difference between a job that cannot be found or losing the underlying SSH connection completely.
This is incorrect behavior. Instead, the
NoSuchJobException
should only be thrown if the slurm commands where executed successfully, but the job could not be found. When the commands fail to run, aXenonException
should be thrown.In addition, the debug messages explaining why the commands failed may be printed as a warning instead of debug?