weinstockj / passenger_count_variant_calling

Count passenger mutations from WGS
Other
0 stars 0 forks source link

DOI

Passenger count variant calling

Below is the passenger count variant calling procedure that was developed for Clonal hematopoiesis is driven by aberrant activation of TCL1A. For questions, contact Josh Weinstock.

The procedure assumes that Mutect2 has already been run in 'tumor-only' mode across the entire genome for samples of interest. For Mutect2 documentation, please refer to the above linked documentation describing a WDL workflow for running Mutect2. The script assumes the suffix of these files is *-filtered.vcf.gz, though changing this to bcf will also work. Storing these files as raw (rather than bgzipped compressed or bcf) VCFs is strongly discouraged.

Currently, this directory is harded coded in create_sample_list.py, and should be modified to the appropriate user directory. In addition, these scripts assume that a sample manifest with metadata on CHIP carriers has already been created. This file (hard-coded at the moment) should be a tab separated file with the following column headers and one row per sample:

  1. Sample (unique sample identifier)
  2. INFERRED_SEX (a column coded as 1/2 for genotype inferred sex)
  3. Gene (a column indicating the mutated driver gene)
  4. haschip (a binary column where CHIP carriers are coded as 1)
  5. STUDY (indicating the cohort the sample is from)

In addition, the script assumes a second variant level metadata file (also hard-coded) in tab separated form. This file should contain one row per driver variant. This file should have the following headers:

  1. Sample
  2. Gene
  3. AD (allelic depths of REF and ALT reads coded as {REF},{ALT})
  4. VAF (alt / (ref + alt))

This file is primarly used to subset the CHIP carriers to those with a single driver.

For the variant calling itself, some secondary files are needed:

  1. "bravo-dbsnp-all.bcf" A BCF of the TOPMed Bravo sitelist of pass variants to exclude. See here for how to download this file.
  2. A file of low complexity regions (bed/mdust.bed/gz)
  3. A file of segmental dupliations (bed/genomicSuperDups.bed), available from UCSC

    For convenience, files 2. and 3. are included in a bed subdirectory.

    Python (>= 3.6) dependencies

  4. pandas
  5. numpy
  6. variantkey
  7. pyfaidx
  8. pyarrow
  9. cyvcf2
  10. logging

Output

The output of create_singleton_dump.py is an Apache parquet file with one row per variant. Several variant quality metrics are included in the output to facilitate downstream filtering.

Helper scripts

Helper scripts are provided for downloading the Bravo site list and installing variantkey.

Notes on portability

This analysis has only been tested on Ubuntu 18.04 with Python (>=3.6).

License

This code is dual-licensed. See the license for further details.

Copyright University of Michigan.