wustl-oncology / analysis-wdls

Scalable genomic analysis pipelines, written in WDL
MIT License
5 stars 11 forks source link

Investigate inconsistent variant results from varscan when using localization_optional #109

Open malachig opened 1 year ago

malachig commented 1 year ago

We have observed that shards of varcan work are sporadically failing and not being caught by Cromwell.

This manifests as statements like this in the Varscan stderr:

[E::bgzf_read_block] Failed to read BGZF block data at offset 2296748725 expected 9949 bytes; hread returned -1
[E::bgzf_read] Read block operation failed with error 4 after 65 of 234 bytes
samtools mpileup: error reading from input file

To get a clean run of VarScan results for comparison we can turn localization optional off by removing this: https://github.com/wustl-oncology/analysis-wdls/blob/abc7e5828dccb96256cf1cdfcfec5133d0a6486a/definitions/tools/varscan_somatic.wdl#L22-L27

Then to hopefully cause Crowell to detect the failures and retry we can add the following:

set -o pipefail

here https://github.com/wustl-oncology/analysis-wdls/blob/abc7e5828dccb96256cf1cdfcfec5133d0a6486a/definitions/tools/varscan_somatic.wdl#L41-L43

malachig commented 1 year ago

To get this working we also had to change the way varscan was run to use pipes instead of redirection.

That did allow the errors to be caught by Cromwell correctly. But the issue remained that when streaming from a bucket we encounter bgzf_read_block errors. In a test run I observed such failures in 3 out of 50 varscan shards. In one of these, the task succeeded on a reattempt. However, in the other two both re-attempts also failed, in very similar but non identical fashion. e.g.

attempt-1/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179935145 expected 10896 bytes; hread returned -1
attempt-1/stderr:[E::bgzf_read] Read block operation failed with error 4 after 155 of 229 bytes
attempt-2/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7179459514 expected 23195 bytes; hread returned -1
attempt-2/stderr:[E::bgzf_read] Read block operation failed with error 4 after 967 of 2007 bytes
attempt-3/stderr:[E::bgzf_read_block] Failed to read BGZF block data at offset 7178631036 expected 10117 bytes; hread returned -1
attempt-3/stderr:[E::bgzf_read] Read block operation failed with error 4 after 222 of 226 bytes

We can investigate our options here (e.g. investigate different versions of htslib, make more reattempts, change the way Varscan does parallel work, don't use varscan. etc), but I think the short term fix is to disable localization_optional for now for VarScan. Something about the way this is working in this context does not seem robust enough for production.

malachig commented 6 months ago

We have also been encountering problems with localization_optional: true for mutect. Here is a related pull request

https://github.com/wustl-oncology/analysis-wdls/pull/139