xenon-middleware / xenon-cli

Perform files and jobs operations with Xenon library from command line
http://nlesc.github.io/Xenon/
Apache License 2.0
2 stars 3 forks source link

Write stdout/stderr logs per job #54

Closed arnikz closed 5 years ago

arnikz commented 6 years ago

How to write stderr/stdout log files (e.g. in the form <jobname>.{e,o}<jobID>) for each job?

Note: I'm using the latest Xenon v2.3.0

sverhoeven commented 6 years ago

For Slurm '%j' in the filename is replaced with the job identifier. See https://slurm.schedmd.com/sbatch.html#lbAF For example --stdout test-%j.log

For Grid Engine you can use pseudo environment variables, as described in qsub man snippet:

          If the  pathname  contains  certain  pseudo
          environment  variables, their value will be expanded at
          runtime of the job and will be used to  constitute  the
          standard  error  stream path name. The following pseudo
          environment variables are supported currently:

          $HOME       home directory on execution machine
          $USER       user ID of job owner
          $JOB_ID     current job ID
          $JOB_NAME   current job name (see -N option)
          $HOSTNAME   name of the execution host
          $TASK_ID    array job task index number

For example --stdout test-\$JOB_ID.log

sverhoeven commented 6 years ago

Xenon could be extended to have a job identifier replacement like %j in filename for the JobDescription.setStdout/setStderr methods. Then an adaptor has to inject the job identifier in the actual filename or reformat and pass it along to the underlying scheduler.

@jmaassen what do you think?

arnikz commented 6 years ago

I've tested xenon-cli (v2.2.1) on SGE with the options --stdout out-\$JOB_ID.log --stderr err-\$JOB_ID.log but got out-.log and err-.log files. However, when submitting a dummy job echo "echo \$JOB_ID"|qsub one gets both files including jobid in their names (STDIN.e[jobid], STDIN.o[jobid]). So, xenon does not substitute the remote JOB_ID but uses the local one (which is of course not set on the submission node).

sverhoeven commented 6 years ago

It's sge job script that need $JOB_ID, which then gets replaced by the actual id. Xenon should just pass $JOB_ID string as is to the job script it submits to sge.

You might need some extra escaping. I expect the job script submitted to sge has #$ -e err-.log string in it, because snake make does some interpolation.

sverhoeven commented 6 years ago

When I run

xenon -vvv scheduler gridengine --location ssh://localhost:10023 \
--user xenon --password javagat  submit \
--stderr test-\$JOB_ID.log hostname

I get a log file called test-1.log and a log message

17:40:21.538 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.gridengine.GridEngineUtils - Created job script:%n#!/bin/sh
#$ -S /bin/sh
#$ -N xenon
#$ -l h_rt=00:15:00
#$ -o /dev/null
#$ -e 'test-$JOB_ID.log'

Can you run xenon-cli in -vvv mode?

arnikz commented 6 years ago

I'm using xenon-cli within snakemake as follows:

snakemake ... --cluster "xenon -vvv scheduler gridengine --location local:// submit ... --stderr stderr-\$JOB_ID.log --stdout stdout-\$JOB_ID.log" &>smk.log&

My smk.log seems truncated and does not show the lines with the batch script:

Submitted job 8 with external jobid '18:07:09.721 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.ScriptingScheduler - creating sub scheduler for gridengine adaptor at local://'.

B.t.w. Have you tried xenon-cli with --location local:// instead after login to the node? (also add --stdout ...)

arnikz commented 6 years ago

I can't reproduce your run using merely xenon-cli .

$ xenon --version
Xenon CLI v2.4.0, Xenon Library v2.6.0
$ xenon -vvv scheduler gridengine --location local:// submit --stderr err-\$JOB_ID.log --stdout out-\$JOB_ID.log --max-run-time 1 --working-directory . echo.sh &> xenon.log &
$ ls
echo.sh  err-.log  out-.log  xenon.log
$ cat xenon.log
18:42:09.576 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.ScriptingScheduler - creating sub scheduler for gridengine adaptor at local://
18:42:09.582 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
18:42:09.640 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
18:42:09.641 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-0
18:42:09.641 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
18:42:09.641 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
18:42:09.708 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
18:42:09.709 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 69 ms, executable = qconf, arguments = [-sql], exitcode = 0, stdout:
all.q
test.q

stderr:

18:42:09.711 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
18:42:09.711 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-1
18:42:09.712 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
18:42:09.712 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-1
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-1
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-1
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-1
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-1
18:42:09.758 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-1
18:42:09.759 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-1
18:42:09.759 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-1
18:42:09.759 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 48 ms, executable = qconf, arguments = [-sq, all.q,test.q], exitcode = 0, stdout:
qname                 all.q
hostlist              @allhosts
seq_no                0,[@n_01_10=1],[@n_11_20=2],[@n_21_30=3],[@n_31_40=4], \
                      [@n_41_50=5],[@n_51_57=6],[@n_58_61=7],[@n_62_70=8], \
                      [@n_71_80=9],[@n_81_90=10],[n0065.compute.hpc=11]
load_thresholds       np_load_avg=1.05
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               mpi threaded
rerun                 FALSE
slots                 12,[@cores_36=36],[@cores_12=12],[@cores_48=48], \
                      [@cores_24=24],[@cores_40=40]
tmpdir                /tmp
shell                 /bin/bash
prolog                root@/opt/sge/umc_scripts/prolog.sh
epilog                root@/opt/sge/umc_scripts/epilog.sh
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         disabled
s_rt                  INFINITY
h_rt                  2400:0:0
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
qname                 test.q
hostlist              n0059.compute.hpc
seq_no                100
load_thresholds       np_load_avg=1.05
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 INTERACTIVE
ckpt_list             NONE
pe_list               make smp mpi threaded
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/bash
prolog                root@/opt/sge/umc_scripts/prologtest.sh
epilog                root@/opt/sge/umc_scripts/epilog.sh
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            testqusers
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         disabled
s_rt                  INFINITY
h_rt                  2400:0:0
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

stderr:

18:42:09.763 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
18:42:09.764 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-2
18:42:09.764 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
18:42:09.764 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
18:42:09.819 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-2
18:42:09.819 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-2
18:42:09.819 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-2
18:42:09.819 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-2
18:42:09.819 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-2
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-2
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-2
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-2
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 57 ms, executable = qconf, arguments = [-spl], exitcode = 0, stdout:
make
mpi
smp
threaded

stderr:

18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-3
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
18:42:09.820 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-3
18:42:09.867 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 47 ms, executable = qconf, arguments = [-sp, make, -sp, mpi, -sp, smp, -sp, threaded], exitcode = 0, stdout:
pe_name            make
slots              1
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE
pe_name            mpi
slots              250
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE
pe_name            smp
slots              1
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $pe_slots
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE
pe_name            threaded
slots              6000
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE
qsort_args         NONE

stderr:

18:42:09.870 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.gridengine.GridEngineSetup - Created setup info, queues = {test.q=QueueInfo [name=test.q, slots=1, parallelEnvironments=[make, smp, mpi, threaded]], all.q=QueueInfo [name=all.q, slots=12, parallelEnvironments=[mpi, threaded]]}, parallel environments = {mpi=ParallelEnvironmentInfo [name=mpi, slots=250, allocationRule=ROUND_ROBIN, ppn=0], smp=ParallelEnvironmentInfo [name=smp, slots=1, allocationRule=PE_SLOTS, ppn=0], threaded=ParallelEnvironmentInfo [name=threaded, slots=6000, allocationRule=PE_SLOTS, ppn=0], make=ParallelEnvironmentInfo [name=make, slots=1, allocationRule=ROUND_ROBIN, ppn=0]}
18:42:09.875 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.gridengine.GridEngineUtils - Created job script:%n#!/bin/sh
#$ -S /bin/sh
#$ -N xenon
#$ -wd '/home/cog/akuzniar/sv-callers/snakemake/sv-calling/test/.'
#$ -l h_rt=00:01:00
#$ -o 'out-.log'
#$ -e 'err-.log'

echo.sh

18:42:09.875 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job
18:42:09.875 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Created Job local-4
18:42:09.875 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
18:42:09.875 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
18:42:09.960 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: getJobStatus for job local-4
18:42:09.960 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-4
18:42:09.960 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-4
18:42:09.960 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-4
18:42:09.961 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: findJob for job local-4
18:42:09.961 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-4
18:42:09.961 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-4
18:42:09.961 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.JobQueueScheduler - local: cleanupJob for job local-4
18:42:09.961 [main] DEBUG nl.esciencecenter.xenon.adaptors.schedulers.RemoteCommandRunner - CommandRunner took 86 ms, executable = qsub, arguments = [], exitcode = 0, stdout:
Your job 3643979 ("xenon") has been submitted

stderr:

3643979
sverhoeven commented 6 years ago

Hmm, seems the shadow (./gradlew installShadowDist) dist and normal dist (./gradlew installDist) behave differently. I was using the normal/non-shadow version.

Seems the application shell scripts are differently

diff build/install/xenon/bin/xenon build/install/xenon-cli-shadow/bin/xenon
```diff
158,162c158,161
< save () {
<     for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
<     echo " "
< }
< APP_ARGS=$(save "$@")
---
> for s in "${@}" ; do
>     s=\"$s\"
>     APP_ARGS=$APP_ARGS" "$s
> done
164,165c163,164
< # Collect all arguments for the java command, following the shell quoting and substitution rules
< eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $XENON_OPTS -classpath "\"$CLASSPATH\"" nl.esciencecenter.xenon.cli.Main "$APP_ARGS"
---
> # Collect JVM options
> JVM_OPTS=$DEFAULT_JVM_OPTS" "$JAVA_OPTS" "$XENON_OPTS
172c171
< exec "$JAVACMD" "$@"
---
> eval \"$JAVACMD\" $JVM_OPTS -jar \"$CLASSPATH\" $APP_ARGS
\ No newline at end of file

When using the shadow dist a double escape is required:

./gradlew installShadowDist
./build/install/xenon-cli-shadow/bin/xenon scheduler gridengine \
--location ssh://localhost:10022 --user xenon --password javagat submit \
--stderr err-\\\$JOB_ID.log --stdout out-\\\$JOB_ID.log hostname
arnikz commented 6 years ago

Yeah, this double escape worked out on its own but not within the double quotes of the snakemake cmd. I've tried it with single quotes instead and that worked! However, we should make this less cryptic for an end user, I think.