wustl-oncology / analysis-wdls

Scalable genomic analysis pipelines, written in WDL
MIT License
5 stars 11 forks source link

Fix germline variant calling inputs for the pvactool proximal variation analysis #125

Closed malachig closed 11 months ago

malachig commented 1 year ago

It seems that the current germline analysis is filtering out common SNPs via a GNOMAD filter here:

https://github.com/wustl-oncology/analysis-wdls/blob/773feeb3a02c90a5d0272e00232a4f1debb9bb3b/definitions/subworkflows/germline_detect_variants.wdl#L83-L92

This makes sense if the goal is to identify putative pathogenic/denovo variants perhaps. Or if the goal is to use this pipeline for germline only somatic variant calling.

BUT, it does not make sense if we want to get germline SNPs for the purposes of phasing with somatic variants and correcting peptide/neoantigen sequences to account for these proximal variants. In that context we want all germline variants that are real. Having a full germline variant VCF would also be useful in other contexts as well. For example, getting SNPs for LOH analysis.

We should update the pipeline to:

malachig commented 1 year ago

Currently in detect variants (used by somatic) the gnomad AF frequency is hard coded to 0.001 here: https://github.com/wustl-oncology/analysis-wdls/blob/d952ad5c29f1f3aa400e15664e49c4ed0fd4c0ad/definitions/detect_variants.wdl#L65

Currently in germline detect variants the gnomad AF frequency is hard coded to 0.05 here: https://github.com/wustl-oncology/analysis-wdls/blob/d952ad5c29f1f3aa400e15664e49c4ed0fd4c0ad/definitions/subworkflows/germline_detect_variants.wdl#L40

malachig commented 1 year ago

Minor point. Unless the naming of these two variables is deliberately different, not sure that we should have such a subtle difference.

One way to fix the problem could be to just change the hard coded value from: 0.05 to 1.1.

Maybe better would be to rename the variable to germline_filter_gnomAD_maximum_population_allele_frequency and pass this up to immuno, so that the user can over ride the default for different purposes. Setting it to 1.1 to allow common variants in the result, and setting it to something low 0.05 (or lower) if looking for rare variants in a pathogenicity analysis.