yuch7 / cwlexec

A new open source tool to run CWL workflows on LSF
Other
36 stars 8 forks source link

Arrays are not scattered when passed to subworkflows #20

Open biokcb opened 6 years ago

biokcb commented 6 years ago

Hi,

We have a simple example workflow that seems to be passing array inputs without scattering them to lower level scripts

top_workflow.cwl calls -> subworkflow.cwl calls -> echocat.cwl calls -> echocat.sh which takes 3 inputs (string, file, file).

subworkflow.cwl just has a single step which takes a string input and a File[] input and passes it to the command line tool. This works fine with CWLEXEC. When I use top_workflow.cwl to scatter over an array of strings or an array of arrays of files, they do not get scattered, but instead passed directly to the command line tool, where it fails because the shell script cannot use it this way. The string array as a single string and the File array of arrays as a single array. Attached is the example and in the output.txt file at line 646 the command is built incorrectly.

SubworkflowArrayScatterError.tar.gz

skeeey commented 6 years ago

@biokcb, Currently, cwlexec does not support to scatter a step on subflow level. This because cwlexec will submit all of jobs in a flow to LSF at the beginning, this will make the jobs to be queued better, so this means cwlexec will expend all of jobs in a flow. If there is a scattered subflow, the problem will be a bit complex, e.g. a subflow depends some other steps, we must wait to other steps are done then expend it, and, there is always a workaround can bypass the scattered subflow, so we finally decide to put this as a low priority, I think we will support this in future.

For your case, you can scatter the echocat.cwl in subworkflow.cwl SubworkflowArrayScatterWorkaround.zip

biokcb commented 6 years ago

@skeeey Thanks for the update! I can definitely implement the workaround for now, but the example I gave was a more minimal one-step sub workflow that reproduced the error. For some of my workflows there are multiple steps that I'd like to be grouped into a sub workflow so that samples can proceed to each step independently. If I scatter per command line tool step, each step expects an array and must wait until all samples are processed in the previous step. If other samples don't need to wait on one particularly time-intensive sample, then our overall time spent processing samples can be reduced. I believe this will be a useful feature for us, so if you are able to support it in the future that would be great. Thanks!

drjrm3 commented 6 years ago

@skeeey Can you explain this workaround a bit more? I don't quite see that this workaround helps our situation, but I want to understand what you mean by this first.

there is always a workaround can bypass the scattered subflow, so we finally decide to put this as a low priority, I think we will support this in future.

skeeey commented 6 years ago

@drjrm3 The workaround is like @biokcb 's way, we can scatter every step for a subflow instead of scatter the whole subflow, indeed, it has the defect as @biokcb said. Currently, we focus to implement the ExpressTool, I think after it is finished, we can solve this problem

skeeey commented 6 years ago

Also need to test #33

nick018905 commented 2 years ago

Hi, @skeeey cwlexec is a very convenient CWL engine to dispatch jobs to IBM LSF. But that's too bad without scatter subworkflow. Is there any plan to support this? Thank you.