yuch7 / cwlexec

A new open source tool to run CWL workflows on LSF
Other
36 stars 8 forks source link

Clarification: Why not use "bsub -w"? #17

Open davisjam opened 6 years ago

davisjam commented 6 years ago

My understanding

As far as I can tell from perusal of the cwlexec source and the description of its behavior here in the README:

(If I misunderstand, please correct me!).

My question

LSF has built-in job dependency monitoring via bsub -w. Why does cwlexec dynamically monitor dependency states instead of offloading the job to LSF?

As a note, this would have the side effect of permitting reasoning about the CWL job from the LSF side using bjdepinfo, which might be useful in its own right. Unless bjdepinfo already tracks dependencies listed by bwait -- does it?

skeeey commented 6 years ago

After we parse the definition of CWL workflow, we get the dependent relationship between the jobs, and then we submit all of workflow jobs to LSF, this will help LSF to queue the jobs more efficient, but, at this moment, we cannot get the actual command for some jobs, this because CWL job command can be constructed with its inputs and arguments.

e.g. there is a flow that has two jobs J1 and J2, J2 is dependent on J1, J2 use the outputs of J1 as its inputs, and the command of J2 is constructed by its inputs (the outputs of J1), so if we use bsub -w, it will be bsub -w done(J1) commandOfJ2, but at this moment, we cannot get the actual command of J2, because we cannot get the J1's outputs, we need to wait the J1 finished (bwait -w done(J1))

davisjam commented 6 years ago

@skeey Thanks, that makes sense. Do you know whether bjdepinfo tracks dynamic dependencies as introduced by bwait?

skeeey commented 6 years ago

Sorry, I don't know about it, you may need to find a LSF dev to ask :)

davisjam commented 6 years ago

OK. I will see what bjdepinfo says and post that here, then close this issue.

davisjam commented 6 years ago

Summary

bjdepinfo does not track dynamic dependencies introduced by bwait. Thus the inter-stage dependencies described in the cwl file cannot be seen at the LSF layer using bjdepinfo.

Evidence

I ran a modified version of 1st-workflow-simplify.cwl since the current version takes a long time to complete the docker pull.

Here is my version:

# cat 1st-workflow-simplify.cwl 
cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
  ex: string

outputs:
  []

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: inp
      extractfile: ex
    out: [example_out]

  echo:
    run: ../1st-tool.cwl
    in:
      message: untar/example_out
    out: []

The message for the echo stage depends on the output from the untar stage, thus requiring cwlexec to use its bwait/bresume logic.

Execution and output:

[root@fin13p 09:42:45 /tmp/1st-workflow] # /ghome/jamiedavis/src/cwl/cwlexec-0.1/cwlexec 1st-workflow-simplify.cwl 1st-workflow-job.yml 
...
[09:42:48.452] INFO  - Workflow "1st-workflow-simplify" started to execute.
[09:42:48.458] INFO  - Started job (untar) with
bsub \
-cwd \
/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a/untar \
-o \
%J_out \
-e \
%J_err \
-env \
TMPDIR=/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a \
tar xf /root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a/untar/hello.tar Hello.java
[09:42:48.459] INFO  - Pre-submitted job (echo) with a placeholder command:
bsub \
-cwd \
/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a/echo \
-o \
%J_out \
-e \
%J_err \
-env \
TMPDIR=/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a \
-H \
/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a/echo/echo
[09:42:48.477] INFO  - Job (untar) was submitted. Job <238> is submitted to default queue <normal>.
[09:42:48.477] INFO  - Job (echo) was submitted. Job <237> is submitted to default queue <normal>.
[09:42:48.494] INFO  - Started to wait for jobs by
bwait \
-w \
done(238)
[09:42:51.089] INFO  - The job (untar) <238> is done with stdout from LSF:
....
[09:42:51.093] INFO  - Resuming job (echo) <237> with
bresume \
237
[09:42:51.105] INFO  - Started to wait for jobs by
bwait \
-w \
done(237)
[09:42:53.661] INFO  - The job (echo) <237> is done with stdout from LSF:
...
File:/root/cwl-workdir/6d4ad383-f64c-45bc-98ee-90822c3ab50a/untar/Hello.java

So:

However, bjdepinfo indicates no dependencies:

[root@fin13p 09:55:24 /tmp/1st-workflow] # bjdepinfo 237
Job <237> does not depend on other jobs. 
davisjam commented 6 years ago

@skeeey Is this behavior worth documenting?

skeeey commented 6 years ago

@davisjam, thanks, I don't think we need to document this, we can keep this issue opened, if someone interest with this topic, he can find it here :)