zeeev / wham

Structural variant detection and association testing
Other
101 stars 25 forks source link

Flags in the info field #25

Closed abolia closed 8 years ago

abolia commented 8 years ago

Hi Zev,

I am trying to understand the various flags in the info field and I have few questions about some of the flags.

_1) What is the difference between NC and NS? _ To my understanding, both seem to be counting soft clipped reads. NS means "The number of primary reads supporting with a soft clip at POS" i.e. reads having soft clipping start exactly at the breakpoint.

NC means "The number of soft-clipped segments that were collapsed into the consensus sequence". Does that mean it counts the reads that have soft clipped segment passing over the breakpoint. For example: If " . " denotes the breakpoint here, " r " is read and " s " soft clipped read then reads counted in NC flag should be like this diagram: . sssssrrrrr rrrrrrrsssssssss sssssrrrrrrrrrrrrrrrrrrrrrrrrrrrr ssssssssssrrrrrr rrrssssssssssssssss

First 3 reads have soft-clipping excatly at breakpint so counted in NS. whereas the last two reads have soft clipped segment spanning over the breakpoint. Also, in my Tx call outputs, I see NS have higher value than NC. That means more reads support exact breakpoint than the ones that have soft-clipped section passing over the breakpoint.

Is my interpretation correct?

_2) What is the difference between NA and NS? _ NA is "The number of reads that support the structural variant listed in ALT". Does this also means the # of reads that have soft clipping at breakpoint. I see this number always have higher value than NS value. Is it because it counts reads that might not pass MQ filter etc. So, NA is total reads supporting breakpoint and NS is number of reads that have passed the threshold filter for MQ, BQ and supports the breakpoint.

Can you correct if I am wrong in my interpretation.

Thank you so much for all your help. Ashini

zeeev commented 8 years ago

NC is the number of reads soft-clipped at the POS.

    X

ssssrrrrrrrr sssrrrrrrrrr ssrrrrrrrrrr

NC is 3 in this case. Three reads soft-clip at the same position.

NS. Is the same as NC on a person by person basis (genotype field).

If you’re only calling one genome NC == NS. If you’re joint calling NC may or may not equal NS.

FORMAT=<ID=NS,Number=1,Type=Integer,Description="Number of reads with a softclip at POS for individual”>

INFO=

I will try to make the docs more clear.

Does this help you?

—Zev

Zev Kronenberg Ph.D. Phone: 208 629 6224

On Mar 2, 2016, at 1:20 PM, abolia notifications@github.com wrote:

Hi Zev,

I am trying to understand the various flags in the info field and I have few questions about some of the flags.

1) What is the difference between NC and NS? To my understanding, both seem to be counting soft clipped reads. NS means "The number of primary reads supporting with a soft clip at POS" i.e. reads having soft clipping start exactly at the breakpoint.

NC means "The number of soft-clipped segments that were collapsed into the consensus sequence". Does that mean it counts the reads that have soft clipped segment passing over the breakpoint. For example: If " . " denotes the breakpoint here, " r " is read and " s " soft clipped read then reads counted in NC flag should be like this diagram: . sssssrrrrr rrrrrrrsssssssss sssssrrrrrrrrrrrrrrrrrrrrrrrrrrrr ssssssssssrrrrrr rrrssssssssssssssss

First 3 reads have soft-clipping excatly at breakpint so counted in NS. whereas the last two reads have soft clipped segment spanning over the breakpoint. Also, in my Tx call outputs, I see NS have higher value than NC. That means more reads support exact breakpoint than the ones that have soft-clipped section passing over the breakpoint.

Is my interpretation correct?

2) What is the difference between NA and NS? NA is "The number of reads that support the structural variant listed in ALT". Does this also means the # of reads that have soft clipping at breakpoint. I see this number always have higher value than NS value. Is it because it counts reads that might not pass MQ filter etc. So, NA is total reads supporting breakpoint and NS is number of reads that have passed the threshold filter for MQ, BQ and supports the breakpoint.

Can you correct if I am wrong in my interpretation.

Thank you so much for all your help. Ashini

— Reply to this email directly or view it on GitHub https://github.com/zeeev/wham/issues/25.

abolia commented 8 years ago

Hi Zev,

Thanks for your reply. I have been calling one single genome (single sample studies for Translocation calling) and I never see NC==NS, which it should be as you mentioned. For example, here are two Tx calls that are true for ALK-EML4 translocation sample.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ALK_S4.bam

2 29448159 . N TGGTGAACATTTTAATGGTTCTGTAGATACTCTCAACNTCCACTTACNCACTTAAAAGATTACAAATTA . . LRT=0;WAF=.,0.500001,0.500001;GC=0,1;AT=0.996262,0.0598131,0.011215,0.0429907,0,0.00747664,0.00373832,0,0,0,0,0,0.102804,0.0224299,9.38974;CF=0.00186916;CISTART=29448141,29448175;CIEND=42525058,42525310;PU=90;SU=3;CU=94;RD=535;NC=77;MQ=60;MQF=0;SP=21,4,0;CHR2=2;DI=f;END=42525185;SVLEN=13077027 GT:GL:NR:NA:NS:RD 0/1:-3232.83,-370.834,-4158.47:301:234:150:535

2 42525164 . N AACCTTCCCCCCACNAGAGCAGCTGCAGTTNCCNGAGGAGCCCCTGATTCTGCACCTCAGNNNNNNNNNNANNN . . LRT=0;WAF=.,1,1;GC=0,1;AT=1,0.761905,0,0.761905,0,0,0,0,0,0,0,0,0.809524,0,0.368569;CF=0;CISTART=42525162,42525164;CIEND=29448006,29448148;PU=20;SU=0;CU=16;RD=21;NC=16;MQ=60;MQF=0;SP=12,0,0;CHR2=2;DI=b;END=29448078;SVLEN=13077085 GT:GL:NR:NA:NS:RD 1/1:-255,-255,-2.1e-05:0:21:20:21

In the first call: NC=77 , NS=150; 2 call: NC=16, NS=20

Ideally in this case they should be equal. But I don't understand why they are not.

Also, can you please help me also understand the difference between NA and NS.

Thank you so much. Ashini

zeeev commented 8 years ago

Ashini,

  1. As we talked about here are the metrics that classify NA vs NR
    • same strand
    • soft clipped within 5bp of breakpoint (this is NS)
    • a read pair 2.5 SD outside normal mapping range
    • mate pair mapped to another chromosome
  2. Read depth includes supplementary reads. Only reads with three cigar operations are filtered.
abolia commented 8 years ago

Thanks Zev. This is very helpful. I don't understand what does "same strand" mean though? Aren't all the read at the break point anyways on same strand. Also I see that NR+NA = RD for most of my cases, which makes sense.

For directionality, the DI field tells if the break point is supported on the 5' of the pileup or 3' end of pileup for the "POS" position. However, is there a way to find out it for the "END" break point too, even if the reciprocal translocation is not called.

Thanks again, Ashini