Open malachig opened 1 year ago
To get this working we also had to change the way varscan was run to use pipes instead of redirection.
That did allow the errors to be caught by Cromwell correctly. But the issue remained that when streaming from a bucket we encounter bgzf_read_block
errors. In a test run I observed such failures in 3 out of 50 varscan shards. In one of these, the task succeeded on a reattempt. However, in the other two both re-attempts also failed, in very similar but non identical fashion. e.g.
attempt-1/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179935145 expected 10896 bytes; hread returned -1
attempt-1/stderr:[E::bgzf_read] Read block operation failed with error 4 after 155 of 229 bytes
attempt-2/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179459514 expected 23195 bytes; hread returned -1
attempt-2/stderr:[E::bgzf_read] Read block operation failed with error 4 after 967 of 2007 bytes
attempt-3/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7178631036 expected 10117 bytes; hread returned -1
attempt-3/stderr:[E::bgzf_read] Read block operation failed with error 4 after 222 of 226 bytes
We can investigate our options here (e.g. investigate different versions of htslib, make more reattempts, change the way Varscan does parallel work, don't use varscan. etc), but I think the short term fix is to disable localization_optional
for now for VarScan. Something about the way this is working in this context does not seem robust enough for production.
We have also been encountering problems with localization_optional: true
for mutect. Here is a related pull request
We have observed that shards of varcan work are sporadically failing and not being caught by Cromwell.
This manifests as statements like this in the Varscan stderr:
To get a clean run of VarScan results for comparison we can turn localization optional off by removing this: https://github.com/wustl-oncology/analysis-wdls/blob/abc7e5828dccb96256cf1cdfcfec5133d0a6486a/definitions/tools/varscan_somatic.wdl#L22-L27
Then to hopefully cause Crowell to detect the failures and retry we can add the following:
here https://github.com/wustl-oncology/analysis-wdls/blob/abc7e5828dccb96256cf1cdfcfec5133d0a6486a/definitions/tools/varscan_somatic.wdl#L41-L43