xenon-middleware / xenon

A middleware abstraction library that provides a simple programming interface to various compute and storage resources.
http://xenon-middleware.github.io/xenon/
Apache License 2.0
33 stars 17 forks source link

Torque adaptor cannot set stdin/stdout/stderr #601

Closed LourensVeen closed 6 years ago

LourensVeen commented 6 years ago

I'm trying to run a CWL runner via Xenon, which writes a description of the results produced on stdout. This works fine with SLURM, but when using Torque, Xenon complains that it cannot set the standard inputs/outputs. See also https://github.com/NLeSC/Xenon/issues/239

It seems simple enough to fix by just adding some redirection operators (i.e. "< stdin.txt >stdout.txt 2>stderr.txt") to the command in the job script that is being generated, in TorqueUtils.generateScriptContent(). The note in the issue above does not say if this was considered, just that it was decided to not support this at all, but well, I need this functionality.

It looks like it would probably work to add those redirection commands to the argument directly, but if Xenon is going to offer an API for this, then it should work.

jmaassen commented 6 years ago

If I remember correctly the stdin/out/err filenames could not be set in torque. Only the defaults can be used.

Either way, Xenon should hide this from the user. Will look into this.

LourensVeen commented 6 years ago

Yes, that's what it says in the other issue. I'm proposing to work around being unable to redirect the standard in/out/err of the job by redirecting the standard in/out/err of the command running inside the job.

LourensVeen commented 6 years ago

I have now worked around this in Cerise (which still uses Xenon 1), by effectively calling bash -c 'cd workdir ; command arguments... >stdout.txt 2>stderr.txt' instead of using Xenon's setWorkingDirectory(), setStdout() and setStderr(). That seems to fix it at least on the BinAC cluster. Note that we needed the explicit cd command. Somehow using setWorkingDirectory() on the bash command doesn't propagate to the command run inside of the subshell, not sure why.

jmaassen commented 6 years ago

We should apply this trick in the script generator for Torque.

jmaassen commented 6 years ago

Would it be OK to fix in the 2.x series, or should we patch a 1.x for you ?

LourensVeen commented 6 years ago

A fix for Xenon 2 would be nice in general I think, but with a working work-around there's no hurry from my side. I'm still planning to upgrade to Xenon 2 anyway; let's not waste time on maintaining 1.x.

jmaassen commented 6 years ago

The setWorkingDirectory is correctly passing the integration tests for Torque on 2.x

jmaassen commented 6 years ago

For 2.x calling setStdout on a JobDescription for Torque produces an InvalidJobDescriptionException on submission, which explains that this is not supported by torque.

Although correct, this requires the user to solve the problem which is not what we want...

LourensVeen commented 6 years ago

It seems that setWorkingDirectory affects the directory in which qsub is called, but on this cluster at least, the job script is apparently started in the home directory regardless of where qsub ran. So that's why we needed the additional cd workdir I think.

If I do the cd workdir but don't call setWorkingDirectory, my command runs in workdir, but the Torque-generated log files <jobname>.o<jobid> and <jobname>.e<jobid> end up in the home directory.

This is a bit of an issue I think. From the user's perspective, I/O has been redirected, and those Torque log files should not exist. We don't really want to force the user to clean them up, but we don't want them littering the remote account either. Hmm.

jmaassen commented 6 years ago

The default behavior for Torque is to use the home directory as workdir and redirect stdout and stderr to the <jobname>.o<jobid> files.

When the setWorkingDirectory is called we generate a #PBS -d <workdir> in the jobscript which should instruct Torque to change the working dir. For 2.x this seems to work correctly in our integration tests. This could be a 1.x bug, I'll see if I can find this.

I'm not sure if we can tell torque not to generate the <jobname>.o<jobid> files. I remember that we could not convince Torque to change their name, that's why we introduced the error. I'll dig a bit deeper...

LourensVeen commented 6 years ago

It's possible that there's something funny with the default bash configuration on this particular cluster as well. Since I'm starting bash, if there's a cd $HOME in the .profile or something, that could also mess things up.

jmaassen commented 6 years ago

I've added support for setStdout, setStderr and setStdin to the Torque adaptor.

The first two translate to:

  #PBS -o </workdir/>out.txt
  #PBS -e </workdir/>err.txt 

where the </workdir/> is explicitly added if set using setWorkingDirectory.

Torque has no direct support for settting stdin (as far as I can tell). So this is implemented by explicitly redirecting to the executable in the job script.

An example, that assumes that /home/xenon is $HOME, test_workdir is the workdir and test_in.txt in the workdir contains the input:

description.setWorkingDirectory("test_workdir");
description.setExecutable("/bin/cat");
description.setStdin("test_in.txt");
description.setStdout("test_out.txt");

This will produce the following job script:

#!/bin/sh
#PBS -S /bin/sh
#PBS -N xenon
#PBS -d /home/xenon/test_workdir
#PBS -o /home/xenon/test_workdir/test_out.txt
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:15:00
/bin/cat< test_in.txt

It seems to work fine in our Torque test setup.

jmaassen commented 6 years ago

Maybe you could test the above job script on your cluster to see what happens? Just replace /home/xenon/test_workdir with an existing workdir and ensure the test_in.txt exists there.

felipeZ commented 6 years ago

It works and produces two files: test_out.txt and xenon.e4979691 inside the test_workdir folder.

jmaassen commented 6 years ago

OK, so the approach used in 2.x seems to give the expect result.

Marking as closed for now. Reopen if you don't agree