zeeev / wham

Structural variant detection and association testing
Other
101 stars 25 forks source link

Filtering the VCF file #22

Closed abolia closed 8 years ago

abolia commented 8 years ago

Hi Zev,

Do you have any recommendations on setting parameters for filtering of the VCF files generated by WHAM?

Thanks, Ashini

zeeev commented 8 years ago

Ashini,

Can you tell me a little more about your experimental design? There are many filtering strategies, so will a little more knowledge I can point you to the correct one.

--Zev

abolia commented 8 years ago

Hi Zev,

The aim of my project is to detect translocations. However, the current problem we are facing with WHAM is way too many false positives (> 1000s) and we are trying ways to lessen them.

Moreover, I have also played with parameters controlling sensitivity and specificity in my WHAM runs:

For Example: m (Min # of soft clips supporting SV start) = 15 (set pretty high) p (Exclude Soft Clipped reads below mapping value ) 20 q (​Exclude Soft Clipped reads with average base quality below Phred-scaled value) = 30

​Therefore, trying to set filters on VCF file that can lessen the # of ​these ​false positives.

So far I have played with Read Depth (RD) > 1500. It detects the true positive but still the # of false positives is way too high.

So, if you can provide any filtering strategy, that would be immensely helpful.

Thanks. Ashini

zeeev commented 8 years ago

Ashini,

Sorry for the late reply. I would suggest using the "AT" field as a set of filters. Specifically the 4th datum in the AT field. You would expect that a true translocation would have a value > than 0.05.

If you have control data that will also help substantially.

If you send me a Snippet of your VCF file I'd be happy to help provide a filtering program for you.

--Zev

abolia commented 8 years ago

Zev,

Thanks for the reply. Attached is the snippet of my VCF file containing the translocations we are looking for. The translocation is on Chr22 (29,684,094-29,684,602) and Chr 11 (32,415,739-32,416,247). I see that its being called but still the values in the 4th AT datum are not greater than 0.05.

sample.txt

Thanks for all your help. Ashini.

zeeev commented 8 years ago

@BrettKennedy @abolia

Here is a filter that should get you much closer.

SVLEN = 0 <- translations CF < 0.2 <- remove sites with excessive cigar operations
CU < 10 <- remove sites where there is excessive soft clipping near by MQ > 30 <- average mapping quality greater than 30 NC > 10 <- number of soft clips supporting breakpoint

In the file Brett sent me: there were 868 calls There is one left after filtering.

~/tools/vcflib/bin/vcffilter -f "SVLEN = 0" -f "CF < 0.2" -f "CU < 10" -f "MQ > 30" XXXX-ALK.wham.raw.vcf | perl -lane 'if($ =~ /^#/){print}else{$z = $1 if $ =~ /NC=(.*?);/; print if $1 > 10}'

abolia commented 8 years ago

Superb. Thanks Zev. I will try it on other samples too and let you know how it works.

Thanks again, Ashini

zeeev commented 8 years ago

Happy to help. I'm closing this issue, but feel free to open up another one.