wustl-oncology / cloud-workflows

Infrastructure and tooling required to get genomic workflows running in the cloud
1 stars 7 forks source link

Investigate use of Google Batch instead of Life Sciences API as a Cromwell backend #40

Open malachig opened 1 month ago

malachig commented 1 month ago

The current backend we are using with Cromwell on GCP is deprecated and will be turned off July 8, 2025. Google now recommends migrating to Google Batch: https://cloud.google.com/life-sciences/docs/getting-support

Newer versions of Cromwell now support GCP batch as a backend. Cromwell documentation on using Batch, including an example cromwell.conf file can be found here: https://cromwell.readthedocs.io/en/develop/tutorials/Batch101/ https://github.com/broadinstitute/cromwell/blob/develop/cromwell.example.backends/GCPBATCH.conf https://cromwell.readthedocs.io/en/develop/backends/GCPBatch/ https://github.com/broadinstitute/cromwell/blob/develop/CHANGELOG.md#gcp-batch

The version of Cromwell and the way the cromwell.conf file is created to specify the backend used are determined by these helper scripts and config files in this repo (these are used in our tutorial on how to run the pipeline on GCP):

In very basic terms.

resources.sh sets up the Google cloud environment (buckets, network, etc.) and creates two cromwell configuration related files: cromwell.conf and workflow_options.json with some user specific parameters populated. These will be copied to specified locations on the Cromwell VM started on Google Cloud.

start.sh launches a VM and specifies that server_startup.py be run as part of the start up process. During this the process, the specified version of Cromwell is installed and launched (using systemctl start cromwell).

manual-workflows/cromwell.service defines some parameters for how Cromwell Server is started, including the location of Cromwell jar and cromwell.conf files.

malachig commented 1 month ago

In order to perform a test run, using an updated version of Cromwell and Google Batch as the backend, I believe the following changes will be required:

Change from Google Life Sciences API to Batch

manual-workflows/base_cromwell.conf

 backend.providers.default {
-  actor-factory = "cromwell.backend.google.pipelines.v2beta.PipelinesApiLifecycleActorFactory"
+  actor-factory = "cromwell.backend.google.batch.GcpBatchBackendLifecycleActorFactory"
...

... and delete two related entries that should not be needed

-  endpoint-url = "https://lifesciences.googleapis.com/"
-  include "papi_v2_reference_image_manifest.conf"

manual-workflows/start.sh Update to a version of Cromwell that supports Google Batch

-  --metadata=cromwell-version=71,analysis-release="$ANALYSIS_RELEASE" \
+  --metadata=cromwell-version=87,analysis-release="$ANALYSIS_RELEASE" \
malachig commented 1 month ago

When attempting this for the first time, this error was encountered:

Oct 09 17:09:02 malachi-immuno java[14044]: Caused by: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Batch API has not been used in project $PROJECT_ID before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/batch.googleapis.com/overview?project=$PROJECT_ID then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Enabling the API was seemingly as simple as visiting that URL and hitting the "ENABLE API" button.

I believe this could be done automatically for the project in question by modifying scripts/enable_api.sh, which is called by manual-workflows/resources.sh.

To replace: gcloud services enable lifesciences.googleapis.com

with gcloud services enable batch.googleapis.com

In the short term, since we will be experimenting with this backend while continuing to use the LifeSciencesAPI, we will want to add rather than replace APIs allowed.

malachig commented 1 month ago

Next errors encountered:

Oct 09 19:30:43 malachi-immuno java[14010]: com.google.api.gax.rpc.PermissionDeniedException: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Permission 'batch.jobs.create' denied on 'projects/griffith-lab/locations/us-central1/jobs/job-9a2008b1-f615-4f9d-9470-93ea163e2eaa'

Oct 09 19:30:43 malachi-immuno java[14010]: Caused by: io.grpc.StatusRuntimeException: PERMISSION_DENIED: Permission 'batch.jobs.create' denied on 'projects/griffith-lab/locations/us-central1/jobs/job-9a2008b1-f615-4f9d-9470-93ea163e2eaa'

It seems that additional Google Cloud IAM permissions may be required. Initial guess is that the service account we are using here would need something like: Batch Job Editor (roles/batch.jobsEditor) on the project. Or perhaps some combination from this more extensive list compiled by reading about Batch generally:

We should already have that last one. Service accounts are currently configured in scripts/create_resources.sh which is called by manual-workflows/resources.sh.

At present we have defined two service accounts according to IAM:

cromwell-compute@griffith-lab.iam.gserviceaccount.com (described as "Cromwell backend compute") with roles:

cromwell-server@griffith-lab.iam.gserviceaccount.com (described as "Cromwell Task Compute VM") with roles:

In the short term, since we will be experimenting with this backend while continuing to use the LifeSciencesAPI, we will want to add rather than replace permissions. I think this can be done simply by updating the two scripts mentioned above and rerunning the resources.sh step.

In summary I think we could start by trying to add the following currently missing role(s) in scripts/create_resources.sh:

gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$SERVER_ACCOUNT" \
       --role='roles/batch.jobsEditor' > /dev/null
malachig commented 1 month ago

Adding the jobEditor permission did seem to allow Cromwell to request VMs and launch jobs. However, everything still seemed to be failing. One apparent error message was:

no VM has agent reporting correctly within the time window 1080 seconds

This sounds related to this:

The job's VMs do not have sufficient permissions. A job's VMs require specific permissions to report their state to the Batch service agent. You can provide these permissions for a job's VMs by granting the Batch Agent Reporter role (roles/batch.agentReporter) to the job's service account.

So I will next add the following to scripts/create_resources.sh:

gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$SERVER_ACCOUNT" \
       --role='roles/batch.agentReporter' > /dev/null
malachig commented 1 month ago

Still no success after this last change.

I do see Cromwell report that task code is being created and jobs are being launched.

The basic structure of the run files including localization and run scripts are being created in the Google Bucket by Cromwell. But nothing is coming back from the VMs.

And I see VMs being started and running in the console. Cromwell seems to be requesting specific resources for each task and I see things like this in the Cromwell logging:

instances {
       machine_type: "e2-standard-32"
       provisioning_model: SPOT
       task_pack: 1
       boot_disk {
         type: "pd-balanced"
         size_gb: 175
         image: "projects/batch-custom-image/global/images/batch-cos-stable-official-20240925-00-p00"
       }
     }
...
instances {
       machine_type: "e2-highmem-2"
       provisioning_model: SPOT
       task_pack: 1
       boot_disk {
         type: "pd-balanced"
         size_gb: 36
         image: "projects/batch-custom-image/global/images/batch-cos-stable-official-20240925-00-p00"
       }
     }

But if I log onto one of these instances through the console, while I see different amount of Memory and CPU, I see no evidence such a storage disk has been attached. Nothing seems to be happening. I suspect Cromwell tries something, times out and fails the task.

I'm still getting these events as well:

no VM has agent reporting correctly within the time window 1080 seconds

I have not seen any other informative logging in the Cromwell log.

One thing I don't full understand is that we are setting this in our cromwell.config file:

genomics.compute-service-account = "cromwell-compute@griffith-lab.iam.gserviceaccount.com"

However, all the IAM permissions we have been conferring are to: cromwell-server@griffith-lab.iam.gserviceaccount.com

We could try adding these permissions to that service account user as well...

malachig commented 1 month ago

This last change seems to have helped and now data input data is being written to a separate volume and mount point on a machine that I logged into: /mnt/disks/cromwell_root

Summary of the addition of permissions so far for SERVER_ACCOUNT:

gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$COMPUTE_ACCOUNT" \
       --role='roles/batch.jobsEditor' > /dev/null
gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$COMPUTE_ACCOUNT" \
       --role='roles/batch.agentReporter' > /dev/null

Summary of the addition of permissions so far for COMPUTE_ACCOUNT:

gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$COMPUTE_ACCOUNT" \
       --role='roles/batch.jobsEditor' > /dev/null
gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$COMPUTE_ACCOUNT" \
       --role='roles/batch.agentReporter' > /dev/null
gcloud projects add-iam-policy-binding $PROJECT \
       --member="serviceAccount:$COMPUTE_ACCOUNT" \
       --role='roles/compute.instanceAdmin' > /dev/null
malachig commented 1 month ago

In my latest test, tasks are now actually running as expected and a few have actually succeeded. But there seems to a problem related to use of preemptible machines.

Our current strategy is to specify something like this in the runtime blocks of individual tasks:

  runtime {
    preemptible: 1
    maxRetries: 2
...

And also in the workflow_options.json that gets created on the Cromwell VM.

{
    "default_runtime_attributes": {
        "preemptible": 1,
        "maxRetries": 2
    },
...
}

When using this with the Google Life Sciences API, this gets in interpreted as: Try at most 1 attempt on a preemptible (much cheaper) instance. If that get preempted, try again on a non-premptible instance. If that fails, try again 2 more times, again only on non-premptible instance. This was all working as expected on the old backend.

So far in my limited testing of GCP Batch, I have a few observations:

The reason I think this is that, when I query the Cromwell log like this:

journalctl -u cromwell | grep provisioning_model | tr -s ' ' | cut -d ' ' -f 6,7 | sort | uniq -c

I get something like this, 318 provisioning_model: SPOT.

In other words, the provisioning model is always reported as SPOT, even though I am getting many failed tasks, includign some that report with logging like this:

status_events {
description: "Job state is set from RUNNING to FAILED for job projects/190642530876/locations/us-central1/jobs/job-c5294312-3480-4865-a880-a8605fb4ba2e.Job failed due to task failure. Specifically, task with index 0 failed due to the following task event: \"Task state is updated from RUNNING to FAILED on zones/us-central1-b/instances/4400131179675275438 due to Spot VM preemption with exit code 50001.\""

From lurking on github issues forums for Cromwell and NextFlow, it sounds like support for a mixed model of preemptible/non-preemptible instances in failure handling, is perhaps half-baked? Perhaps with changes to the GCP Batch API itself still being contemplated?

This would be unfortunate from a cost perspective, but also from a testing perspective.

Is there even a convenient way to do a test run with no use of preemptible instances? Every task in the WDL currently sets the above parameters. Does the setting in workflow_options.json override those task specific settings? Or is that just a default if no specific retry behavior is specified in a task. If that is the case, we might need to modify the WDLs to move forward here.

Next testing ideas: First, try changing just the /shared/cromwell/workflow_options.json file to have "preemptible": 0

Note the according to the Crowell docs:

You can supply a default for any Runtime Attributes by adding a default_runtime_attributes map to your workflow options file. Use the key to provide the attribute name and the value to supply the default. These defaults replace any defaults in the Cromwell configuration file but are themselves replaced by any values explicitly provided by the task in the WDL file.

If that doesn't work (which seems likely), then we must change every relevant WDL on the VM:

cd /shared/analysis-wdls
grep preemptible definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
sudo sed -i 's/preemptible\: 1/preemptible\: 0/g' definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
sudo sed -i 's/preemptible_tries \= 3/preemptible_tries \= 0/g' definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
grep preemptible definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
refresh_zip_deps

Note that when running these sed commands on a Mac I had to tweak them slightly:

cd analysis-wdls
grep preemptible definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
sed -i '' 's/preemptible\: 1/preemptible\: 0/g' definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
sed -i '' 's/preemptible_tries \= 3/preemptible_tries \= 0/g' definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
grep preemptible definitions/*.wdl definitions/subworkflows/*.wdl definitions/tools/*.wdl
sh zip_wdls.sh
malachig commented 1 month ago

Note this change log: https://github.com/broadinstitute/cromwell/blob/develop/CHANGELOG.md#gcp-batch

The latest release of Cromwell (v88), not actually available at this time? Appears to describe some updates relevant to pre-empting.

Fixes the preemption error handling, now, the correct error message is printed, this also handles the other potential exit codes. Fixes error message reporting for failed jobs.

Not sure if that fixes the problem we have, or just makes the error messages more clear.

It seems that maybe the gcp-batch branch is abandoned though?

This branch is 214 commits ahead of, 219 commits behind develop.

The last commit to that branch was Jul 14, 2023. The develop branch by contrast, is very active and has multiple recent commits related to GCP Batch. Which gives hope that a version 88 will continue to improve support from a current stage that already seems very close to working?

If one wanted to experiment with building a .jar from the current develop branch of the Cromwell code, @tmooney shared this documentation on how he did so in the past: https://github.com/genome/genome/wiki/Cromwell

malachig commented 1 month ago

Using the method above, to change all WDLs (only locally on the VM for experimentation purposes) to use no preemptible instances, resulted in the following apparent change in the Crowell log output. Now instances blocks state: provisioning_model: STANDARD, instead of provisioning_model: SPOT.

malachig commented 1 month ago

Note: It is possible that the issue with handling preemption has been fixed and will be incorporated into the next release of Cromwell. Various reported issues mention GCP batch and preemption such as: https://github.com/broadinstitute/cromwell/issues/7407 with associate PRs that have been merged in the develop branch of Cromwell.

malachig commented 1 month ago

The current test got very far, but failed on the germline vep task. I believe I saw this failure with the stable pipeline as well (I believe I had to increase disk space for this step). In this failure, I don't see any relevant error messages and it seems like no second attempt was made... Perhaps running out of disk resulted in an unusual failure mode. But, this could also indicate additional issues with the reattempt logic for GCP batch not working as expected.

Will increase disk space for VEP for this test and attempt a restart to see if the call caching is working.

I think this is not calculating size needs aggressively enough anyway (we have seen this failure a few times now):

-  Float cache_size = 2*size(cache_dir_zip, "GB")  # doubled to unzip
+  Float cache_size = 3*size(cache_dir_zip, "GB")  # doubled to unzip

-  Float vcf_size = 2*size(vcf, "GB")  # doubled for output vcf
+  Float vcf_size = 3*size(vcf, "GB")  # doubled for output vcf

-  Int space_needed_gb = 10 + round(reference_size + vcf_size + cache_size + size(synonyms_file, "GB"))
+  Int space_needed_gb = 20 + round(reference_size + vcf_size + cache_size + size(synonyms_file, "GB"))
malachig commented 1 month ago

Call caching seems to be working at first glance. Though, some steps appear to be redoing work that was already done?:

Note this is far from the first time I have observed Cromwell to redo work that was completed in a previous run. We have never dug too deep to understand these cases. In other words, this may not have anything to do with GCP Batch.

tmooney commented 1 month ago

The diff API endpoint is good for finding why call caching didn't match the second time: https://cromwell.readthedocs.io/en/stable/api/RESTAPI/#explain-hashing-differences-for-2-calls If the inputs are somehow busting the cache could be a simple change to get it to work.

malachig commented 1 month ago

My first end-to-end test worked (with one restart) and I was able to pull the results down. The results superficially seem to look good but of course more formal testing and comparisons must be done.

Other than having to turn off use of preemptible nodes, the other thing that did not work was the estimate_billing.py script.

python3 $WORKING_BASE/git/cloud-workflows/scripts/estimate_billing.py $WORKFLOW_ID $GCS_BUCKET_PATH/workflow_artifacts/$WORKFLOW_ID/metadata/ > costs.json

This gave the following Python error:

Traceback (most recent call last):
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 265, in <module>
    cost = cost_workflow(args.metadata_dir.rstrip('/'), args.workflow_id)
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 231, in cost_workflow
    call_costs_by_name[ck] = cost_workflow(location, call["subWorkflowId"])
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 231, in cost_workflow
    call_costs_by_name[ck] = cost_workflow(location, call["subWorkflowId"])
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 229, in cost_workflow
    call_costs_by_name[ck] = cost_cached_call(location, call, metadata)
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 203, in cost_cached_call
    return cost_task(call_data)
  File "/storage1/fs1/gillandersw/Active/Project_0001_Clinical_Trials/compassionate_use/analysis/pici_case_4/gcp_immuno/v1.2.1/git/cloud-workflows/scripts/estimate_billing.py", line 148, in cost_task
    assert is_run_task(task)
AssertionError

My first guess would be that something related to updating the Cromwell version we are using has changed the structure of the outputs being parsed by this code? At first glance the .json metadata files appear to contain the relevant information.

malachig commented 1 month ago

To facilitate further testing of GCP Batch and Cromwell 87, I have created a pre-release for cloud-workflows and analysis-wdls.

If you checkout the cloud-workflows pre-release it will automatically take into account all the changes described above, including automatically cloning the analysis-wdls pre-release on the Cromwell head node. In other words, once you clone that pre-release, you can follow the usual procedure for performing a test run.

analysis-wdls (v1.2.2-gcpbatch) cloud-workflows (v1.4.0-gcpbatch)

I am currently using this version for a test with the hcc1395 full test data.

malachig commented 1 month ago

This test completed smoothly without any need to restart or any other issues.

Conclusion for now. We can start using GCP Batch if we need/want to with the following two known caveats:

ldhtnp commented 1 month ago

I looked into the estimate billing issue. According to the output, the error is occurring with the line "assert is_run_task(task)". The function is_run_task is checking if the "jes" key is in the associated json data. This key is present in the v1.1.4 json data, but is not present in the v1.2.1 json data. This function is then returning "False", which is causing the assertion error. This "jes" key contains the following entries: endpointUrl, machineType, googleProject, executionBucket, zone, and instanceName. I do not find these entries anywhere in the v1.2.1 json data.