wustl-oncology / analysis-wdls

Scalable genomic analysis pipelines, written in WDL
MIT License
5 stars 11 forks source link

Troubleshoot failures during localization step #154

Open malachig opened 1 week ago

malachig commented 1 week ago

In several immuno.wdl runs we have encountered errors during the localization step.

These errors look something like this:

Sep 02 11:31:08 malachi-immuno-5120-34 java[12774]: 2024-09-02 11:31:08,676 cromwell-system-akka.dispatchers.engine-dispatcher-25529 INFO  - WorkflowManagerActor: Workflow 011e8c48-c534-4240-9638-3357a2d68bc9 failed (during ExecutingWorkflowState): java.lang.Exception: Task somaticExome.cnvkit:NA:4 failed. The job was stopped before the command finished. PAPI error code 10. The assigned worker has failed to complete the operation

The bucket for this task is essentially empty and the main log file indicates that the step did not even finish copying the input files to the VMs local disk. We have seen these failures in the following tasks: cnvkit, docm-cle, mutect.

We seem to have some examples where simply restarting the workflow without changing anything leads to success.

There are at least three theories for the cause of the failure:

The goal of this issue is just to try to gather more examples for now of the steps that fail and what their resource requests look like.

ksinghal28 commented 1 week ago

I also saw this failure during the pvacseq.ps step in my recent test run on the 5120-34 case using the new release candidate branch. This case previously failed with the same localization error code 10 during the cnvkit and docm-cle steps. For my most recent run I increased memory for those steps from 4GB to 8GB. This run got past those steps, but then failed with the same error in pvacseq.

Restarting the run with pvacseq memory increased from 16GB to 32GB